# Data Quality Project
Analysis of Milan Personal Services - Database 12 <br>
Group Project Report in DATA INFORMATION AND QUALITY (2024-2025) <br>
Mauro Orazio Drago, Dennis Pierantozzi, Davide Morelli

## Data Analysis
We have decided to perform classification to detect the "Tipo esercizio" column.

In [61]:
SERVICES = pd.read_csv('/kaggle/input/servizi/Comune-di-Milano-Servizi-alla-persona-parrucchieri-estetisti.csv',sep=';',encoding='unicode_escape')
SERVICES.head()

  has_large_values = (abs_vals > 1e6).any()
  has_small_values = ((abs_vals < 10 ** (-self.digits)) & (abs_vals > 0)).any()
  has_small_values = ((abs_vals < 10 ** (-self.digits)) & (abs_vals > 0)).any()


Unnamed: 0,Tipo esercizio pa,Ubicazione,Tipo via,Via,Civico,Codice via,ZD,Prevalente,Superficie altri usi,Superficie lavorativa
0,,LGO DEI GELSOMINI N. 10 (z.d. 6),LGO,DEI GELSOMINI,10,5394.0,6,,,55.0
1,,PZA FIDIA N. 3 (z.d. 9),PZA,FIDIA,3,1144.0,9,CENTRO MASSAGGI RILASSANTI NON ESTETICI,2.0,28.0
2,,VIA ADIGE N. 10 (z.d. 5),VIA,ADIGE,10,4216.0,5,CENTRO BENESSERE,2.0,27.0
3,,VIA BARACCHINI FLAVIO N. 9 (z.d. 1),VIA,BARACCHINI FLAVIO,9,356.0,1,TRUCCO SEMIPERMANENTE,,
4,,VIA BERGAMO N. 12 (z.d. 4),VIA,BERGAMO,12,3189.0,4,,,50.0


Features we are going to use:
* Tipo esercizio pa
* Tipo via
* Via
* ZD
* Superficie altri usi
* Superficie lavorativa

In [63]:
SERVICES = SERVICES.drop(columns=["Civico", "Via", "Prevalente", "Ubicazione", "Codice via"])

## Encoding
* Tipo esercizio pa: encoding used LabelEncoder of sklearn
* Tipo via: one hot encoding
* ZD: one hot encoding

In [64]:
from sklearn.preprocessing import LabelEncoder

In [65]:
label_encoder = LabelEncoder()
SERVICES["tipo_esercizio_encoded"] = label_encoder.fit_transform(SERVICES["Tipo esercizio pa"])
SERVICES = SERVICES.drop(columns=["Tipo esercizio pa"])

# Display the first few rows to confirm changes
SERVICES.tipo_esercizio_encoded.unique()

array([  6,   0,   1,   4,   5,   7,   2,   8,   3,   9,  10,  11,  12,
        13,  14, 102,  15,  16,  17,  18,  19,  20,  21,  22,  25,  23,
        24,  26,  29,  27,  28,  30,  40,  33,  34,  35,  36,  37,  38,
        39,  31,  32,  41,  42,  43,  44,  45,  46,  47,  48,  49,  57,
        58,  52,  53,  54,  55,  56,  50,  51,  59,  66,  62,  63,  64,
        65,  60,  61,  67,  68,  89,  91,  90,  92,  93,  79,  80,  81,
        82,  83,  84,  85,  86,  87,  88,  69,  71,  70,  72,  73,  78,
        74,  75,  77,  76,  94,  95,  96,  97,  98,  99, 100, 101])

In [68]:
SERVICES = pd.get_dummies(SERVICES, columns=["Tipo via"], prefix="tipo_via", drop_first=True)
SERVICES = pd.get_dummies(SERVICES, columns=["ZD"], prefix="zd", drop_first=True)

## Null values
The rows that has a null values for the column "Tipo esercizio pa" have been dropped. <br>
For the null values in "Superficie lavorativa" and "Superficie altri usi" the null values have been filled with the median of the values.

In [None]:
SERVICES = SERVICES.dropna(subset=["Tipo esercizio pa"])

In [66]:
# Step 2: Replace missing values in "superficie lavorativa" with the median
median_superficie = SERVICES["Superficie lavorativa"].median(skipna=True)
median_superficie_altri_usi = SERVICES["Superficie altri usi"].median(skipna=True)

SERVICES["Superficie lavorativa"] = SERVICES["Superficie lavorativa"].fillna(median_superficie)
SERVICES["Superficie altri usi"] = SERVICES["Superficie altri usi"].fillna(median_superficie)

In [69]:
SERVICES.head()

Unnamed: 0,Superficie altri usi,Superficie lavorativa,tipo_esercizio_encoded,tipo_via_BST,tipo_via_COMO,tipo_via_CSO,tipo_via_FOR,tipo_via_GLL,tipo_via_LGO,tipo_via_PAS,...,tipo_via_VLO,zd_2,zd_3,zd_4,zd_5,zd_6,zd_7,zd_8,zd_9,zd_ACCONCIATORE
31,34.0,68.0,6,False,False,True,False,False,False,False,...,False,False,True,False,False,False,False,False,False,False
32,34.0,34.0,6,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
33,195.0,34.0,0,False,True,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,True
34,34.0,34.0,6,False,False,True,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
35,34.0,25.0,6,False,False,True,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


## Model
We have used a Random Forest Classification. <br>
The dataset has been splitted in train and test (20%).

In [73]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score

X = SERVICES.drop(columns=['tipo_esercizio_encoded'])  # Drop the target column
y = SERVICES['tipo_esercizio_encoded']  # Target is the encoded 'tipo esercizio'

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the classifier (RandomForest in this case)
classifier = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the model
classifier.fit(X_train, y_train)

# Make predictions
y_pred = classifier.predict(X_test)

# Evaluate the model
print("Accuracy:", accuracy_score(y_test, y_pred))
#print("Classification Report:\n", classification_report(y_test, y_pred))

Accuracy: 0.1958762886597938
