# 3. Data Analysis
**Group Project for DATA INFORMATION AND QUALITY (2024-2025)** <br>
Analysis of Milan Personal Services - Database 12 <br>
Mauro Orazio Drago, Dennis Pierantozzi, Davide Morelli

The project requests to choose a type of analysis to perform on both the dirty and clean datasets to assess the quality of the data preparation pipeline developed in the previous steps. <br>

We have decided to focus on a **classification task for the "Tipo esercizio pa" column**, aiming to predict its value based on the other available data.

### Import the repository from GitHub

First of all we start by importing the repository that we stored in the github project.

In [1]:
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
secret_value_1 = user_secrets.get_secret("NEW_GITHUB_TOKEN")

In [2]:
token = UserSecretsClient().get_secret("NEW_GITHUB_TOKEN")
! git clone https://{token}@github.com/madratak/Milan_Services_2012.git

Cloning into 'Milan_Services_2012'...
remote: Enumerating objects: 172, done.[K
remote: Counting objects: 100% (172/172), done.[K
remote: Compressing objects: 100% (139/139), done.[K
remote: Total 172 (delta 76), reused 69 (delta 19), pack-reused 0 (from 0)[K
Receiving objects: 100% (172/172), 4.95 MiB | 10.11 MiB/s, done.
Resolving deltas: 100% (76/76), done.


## Set up
Below some libraries, dataset dirty and cleaned one have been imported.

In [3]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import classification_report, accuracy_score

# Set max column width to None to display full content
pd.set_option('display.max_colwidth', None)

import warnings
# Suppress specific warnings
warnings.filterwarnings("ignore", category=RuntimeWarning)  # Replace 'UserWarning' with the type of warning you want to ignore.

In [4]:
%cd Milan_Services_2012

/kaggle/working/Milan_Services_2012


In [5]:
SERVICES = pd.read_csv('data/raw/Comune-di-Milano-Servizi-alla-persona-parrucchieri-estetisti.csv',sep=';',encoding='unicode_escape')
SERVICES.head()

Unnamed: 0,Tipo esercizio pa,Ubicazione,Tipo via,Via,Civico,Codice via,ZD,Prevalente,Superficie altri usi,Superficie lavorativa
0,,LGO DEI GELSOMINI N. 10 (z.d. 6),LGO,DEI GELSOMINI,10,5394.0,6,,,55.0
1,,PZA FIDIA N. 3 (z.d. 9),PZA,FIDIA,3,1144.0,9,CENTRO MASSAGGI RILASSANTI NON ESTETICI,2.0,28.0
2,,VIA ADIGE N. 10 (z.d. 5),VIA,ADIGE,10,4216.0,5,CENTRO BENESSERE,2.0,27.0
3,,VIA BARACCHINI FLAVIO N. 9 (z.d. 1),VIA,BARACCHINI FLAVIO,9,356.0,1,TRUCCO SEMIPERMANENTE,,
4,,VIA BERGAMO N. 12 (z.d. 4),VIA,BERGAMO,12,3189.0,4,,,50.0


In [6]:
SERVICES_CLEANED = pd.read_csv('data/cleaned/cleaned-SERVICES.csv')
SERVICES_CLEANED.head()

Unnamed: 0,t_es,t_via,via,civ,cod_via,zd,sup_alt,sup_lav
0,Tipo A - Estetica Manuale,Piazza,DEL DUOMO,17,1,1,8.0,74.0
1,Tipo A - Estetica Manuale;Tipo B - Centro di Abbronzatura,Corso,GIUSEPPE GARIBALDI,104,1010,1,26.5,48.5
2,Tipo A - Estetica Manuale;Parrucchiere per Donna,Corso,GIUSEPPE GARIBALDI,110,1010,1,37.0,88.0
3,Acconciatore,Corso,GIUSEPPE GARIBALDI,39,1010,1,6.0,54.0
4,Parrucchiere per Donna,Corso,GIUSEPPE GARIBALDI,46,1010,1,4.0,31.0


## Dirty Dataset Pipeline

First of all we have identified the features we wanted to use for our task. <br>
Features we are going to use:
* `Tipo esercizio pa`
* `Tipo via`
* `Via`
* `ZD`
* `Superficie altri usi`
* `Superficie lavorativa`

In [7]:
SERVICES = SERVICES.drop(columns=["Civico", "Via", "Prevalente", "Ubicazione", "Codice via"])

From the first assessment we have made we know that the the dirty dataset contains a lot of null values and categorical values. <br>
We need to handle the missing values and encode the categorical features to make them suitable for the training phase.

In [8]:
SERVICES.dtypes

Tipo esercizio pa         object
Tipo via                  object
ZD                        object
Superficie altri usi     float64
Superficie lavorativa    float64
dtype: object

In [9]:
null_count = SERVICES.isnull().sum()
print('Number of null values:\n', null_count)

Number of null values:
 Tipo esercizio pa          31
Tipo via                    1
ZD                          1
Superficie altri usi     3164
Superficie lavorativa    1308
dtype: int64


### Handling null values
For the values in `Superficie lavorativa` and `Superficie altri usi` we have filled the null values with the median of the values to keep the same idea used in our data cleaning phase.

* `Tipo esercizio pa` null values are dropped
* `Superficie lavorativa` filled with the median of the values
* `Superficie altri usi` filled with the median of the values

In [10]:
# Step 2: Replace missing values in "superficie lavorativa" with the median
median_superficie = SERVICES["Superficie lavorativa"].median(skipna=True)
median_superficie_altri_usi = SERVICES["Superficie altri usi"].median(skipna=True)

SERVICES["Superficie lavorativa"] = SERVICES["Superficie lavorativa"].fillna(median_superficie)
SERVICES["Superficie altri usi"] = SERVICES["Superficie altri usi"].fillna(median_superficie)

SERVICES = SERVICES.dropna(subset=["Tipo esercizio pa"])

### Encoding
Since the `Tipo esercizio pa` has **103** unique values we have decided to perform a Label Encoding instead of a One-Hot Encoding technique.<br>

* `Tipo esercizio pa`: encoding used LabelEncoder of sklearn
* `Tipo via`: one hot encoding
* `ZD`: one hot encoding

In [11]:
# Count occurrences of each class
class_counts = SERVICES["Tipo esercizio pa"].value_counts()

# Filter out classes with only one instance
SERVICES = SERVICES[SERVICES["Tipo esercizio pa"].isin(class_counts[class_counts > 1].index)]

# Re-encode the labels after filtering
label_encoder = LabelEncoder()
SERVICES["tipo_esercizio_encoded"] = label_encoder.fit_transform(SERVICES["Tipo esercizio pa"])

In [12]:
print(class_counts)

Tipo esercizio pa
Parrucchiere per signora                           1048
ACCONCIATORE                                        586
Parrucchiere per uomo                               439
TIPO A - REG.2003                                   335
TIPO A - REG.2003;TIPO B CENTRO DI ABBRONZATURA     313
                                                   ... 
TIPO A-B-C-D;Acconciatore                             1
TIPO A-B-C-D;ACCONCIATORE                             1
TIPO A-B-C-D;Estetista in profumeria                  1
TIPO A ESTETICA MANUALE;Acconciatore                  1
Truccatore                                            1
Name: count, Length: 103, dtype: int64


In [13]:
SERVICES = pd.get_dummies(SERVICES, columns=["Tipo via"], prefix="tipo_via", drop_first=True)
SERVICES = pd.get_dummies(SERVICES, columns=["ZD"], prefix="zd", drop_first=True)

In [14]:
SERVICES.head()

Unnamed: 0,Tipo esercizio pa,Superficie altri usi,Superficie lavorativa,tipo_esercizio_encoded,tipo_via_BST,tipo_via_CSO,tipo_via_FOR,tipo_via_GLL,tipo_via_LGO,tipo_via_PAS,...,tipo_via_VLE,tipo_via_VLO,zd_2,zd_3,zd_4,zd_5,zd_6,zd_7,zd_8,zd_9
31,Acconciatore,34.0,68.0,5,False,True,False,False,False,False,...,False,False,False,True,False,False,False,False,False,False
32,Acconciatore,34.0,34.0,5,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
34,Acconciatore,34.0,34.0,5,False,True,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
35,Acconciatore,34.0,25.0,5,False,True,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
36,Acconciatore,34.0,28.0,5,False,True,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


### Model Selection
The dataset has been splitted in train (80%) and test (20%). <br>
We have used a **Random Forest Classification** from the sklearn library.

In [15]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score

X = SERVICES.drop(columns=['tipo_esercizio_encoded', 'Tipo esercizio pa'])  # Drop the target column
y = SERVICES['tipo_esercizio_encoded']  # Target is the encoded 'tipo esercizio'

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Initialize the classifier (RandomForest in this case)
classifier = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the model
classifier.fit(X_train, y_train)

# Make predictions
y_pred = classifier.predict(X_test)

accuracy_rf_dirty = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred, output_dict=True)

# Extract overall precision, recall, and F1-score from the 'weighted avg' section
precision_rf_dirty = report['weighted avg']['precision']
recall_rf_dirty = report['weighted avg']['recall']
f1_score_rf_dirty = report['weighted avg']['f1-score']

# Print the results
print("Measure results\n")
print(f"Accuracy: {accuracy_rf_dirty:8.4f}")
print(f"Precision: {precision_rf_dirty:8.4f}")
print(f"Recall: {recall_rf_dirty:8.4f}")
print(f"F1 Score: {f1_score_rf_dirty:8.4f}")

Measure results

Accuracy:   0.1966
Precision:   0.1637
Recall:   0.1966
F1 Score:   0.1769


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [16]:
# Initialize the classifier (RandomForest in this case)
xgb_classifier = XGBClassifier(n_estimators=100, random_state=42, use_label_encoder=False, eval_metric='logloss')

# Train the model
xgb_classifier.fit(X_train, y_train)

# Make predictions
y_pred = xgb_classifier.predict(X_test)

accuracy_xgb_dirty = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred, output_dict=True)

# Extract overall precision, recall, and F1-score from the 'weighted avg' section
precision_xgb_dirty = report['weighted avg']['precision']
recall_xgb_dirty = report['weighted avg']['recall']
f1_score_xgb_dirty = report['weighted avg']['f1-score']

# Print the results
print("Measure results\n")
print(f"Accuracy: {accuracy_xgb_dirty:8.4f}")
print(f"Precision: {precision_xgb_dirty:8.4f}")
print(f"Recall: {recall_xgb_dirty:8.4f}")
print(f"F1 Score: {f1_score_xgb_dirty:8.4f}")

Measure results

Accuracy:   0.2474
Precision:   0.2087
Recall:   0.2474
F1 Score:   0.2111


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


## Clean Dataset Pipeline
For the cleaned dataset we have performed the same steps as did for the dirty dataset but in this case there is no need to handle null values. <br>
Features we have selected:
* `t_es`
* `t_via`
* `zd`
* `sup_alt`
* `sup_lav`

In [17]:
# The cleaned dataset doesn't have null values
null_count = SERVICES_CLEANED.isnull().sum()
print('Number of null values:\n', null_count)

Number of null values:
 t_es       0
t_via      0
via        0
civ        0
cod_via    0
zd         0
sup_alt    0
sup_lav    0
dtype: int64


In [18]:
# Drop the unselected features
SERVICES_CLEANED = SERVICES_CLEANED.drop(columns=["civ", "via", "cod_via"])

### Encoding
We have encoded the `t_es`, `t_via`, `zd` as did for the columns `Tipo esercizio pa`, `Tipo via`, `ZD` of the dirty dataset respectively.

In [19]:
class_counts = SERVICES_CLEANED["t_es"].value_counts()
SERVICES_CLEANED = SERVICES_CLEANED[SERVICES_CLEANED["t_es"].isin(class_counts[class_counts > 1].index)]

In [20]:
label_encoder = LabelEncoder()
SERVICES_CLEANED["t_es_encoded"] = label_encoder.fit_transform(SERVICES_CLEANED["t_es"])

# Display the first few rows to confirm changes
SERVICES_CLEANED.t_es_encoded.unique()

array([21, 26, 25,  0,  8,  1, 14,  2, 19, 11,  5, 29, 17, 31, 12, 15,  7,
       28, 23,  3,  9, 18, 16, 32,  4, 10, 24, 27,  6, 20, 13, 30, 22, 33])

In [21]:
SERVICES_CLEANED = pd.get_dummies(SERVICES_CLEANED, columns=["t_via"], prefix="t_via", drop_first=True)
SERVICES_CLEANED = pd.get_dummies(SERVICES_CLEANED, columns=["zd"], prefix="zd", drop_first=True)

### Model Selection
The dataset has been splitted with the same ratio used for the dirty dataset: 80% training, 20% test. <br>
We have tried **Random Forest** as did for the dirty dataset. <br>

Then we have also tried **XGBoost Classifier** to improve the accuracy further.

In [22]:
X = SERVICES_CLEANED.drop(columns=['t_es_encoded', 't_es'])  # Drop the target column
y = SERVICES_CLEANED['t_es_encoded']  # Target is the encoded 'tipo esercizio'


# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

In [23]:
# Random Forest Classifier

rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the model
rf_classifier.fit(X_train, y_train)

# Make predictions
y_pred = rf_classifier.predict(X_test)

accuracy_rf_clean = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred, output_dict=True)

# Extract overall precision, recall, and F1-score from the 'weighted avg' section
precision_rf_clean = report['weighted avg']['precision']
recall_rf_clean = report['weighted avg']['recall']
f1_score_rf_clean = report['weighted avg']['f1-score']

# Print the results
print("Measure results\n")
print(f"Accuracy: {accuracy_rf_clean:8.4f}")
print(f"Precision: {precision_rf_clean:8.4f}")
print(f"Recall: {recall_rf_clean:8.4f}")
print(f"F1 Score: {f1_score_rf_clean:8.4f}")

Measure results

Accuracy:   0.7579
Precision:   0.7274
Recall:   0.7579
F1 Score:   0.7400


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [24]:
# XGB Classifier

xgb_classifier = XGBClassifier(n_estimators=100, random_state=42, use_label_encoder=False, eval_metric='logloss')

xgb_classifier.fit(X_train, y_train)

# Make predictions
y_pred = xgb_classifier.predict(X_test)

accuracy_xgb_clean = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred, output_dict=True)

# Extract overall precision, recall, and F1-score from the 'weighted avg' section
precision_xgb_clean = report['weighted avg']['precision']
recall_xgb_clean = report['weighted avg']['recall']
f1_score_xgb_clean = report['weighted avg']['f1-score']

# Print the results with aligned numbers
print("Measure results\n")
print(f"Accuracy:  {accuracy_xgb_clean:8.4f}")
print(f"Precision: {precision_xgb_clean:8.4f}")
print(f"Recall:    {recall_xgb_clean:8.4f}")
print(f"F1 Score:  {f1_score_xgb_clean:8.4f}")

Measure results

Accuracy:    0.8154
Precision:   0.8072
Recall:      0.8154
F1 Score:    0.8096


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


## Conclusions
Form our analysis for the classification task we have obtained the following results:


In [25]:
results = {
    "Model-Dataset": [
        "RandomForest-DatasetDirty",
        "RandomForest-DatasetCleaned",
        "XGB-DatasetDirty",
        "XGB-DatasetCleaned"
    ],
    "Accuracy": [accuracy_rf_dirty, accuracy_rf_clean, accuracy_xgb_dirty, accuracy_xgb_clean],
    "Precision": [precision_rf_dirty, precision_rf_clean, precision_xgb_dirty, precision_xgb_clean],
    "Recall": [recall_rf_dirty, recall_rf_clean, recall_xgb_dirty, recall_xgb_clean],
    "F1 Score": [f1_score_rf_dirty, f1_score_rf_clean, f1_score_xgb_dirty, f1_score_xgb_clean]
}

# Create the DataFrame
results_df = pd.DataFrame(results)

# Print the reordered table
results_df

Unnamed: 0,Model-Dataset,Accuracy,Precision,Recall,F1 Score
0,RandomForest-DatasetDirty,0.196615,0.163688,0.196615,0.176864
1,RandomForest-DatasetCleaned,0.757943,0.727417,0.757943,0.739996
2,XGB-DatasetDirty,0.247396,0.208719,0.247396,0.211121
3,XGB-DatasetCleaned,0.815431,0.807155,0.815431,0.809613


**Our conclusions:**

* The cleaned dataset significantly improves the performance of both models, particularly for XGBoost, which performs best when the data is well-prepared. The dirty dataset severely hampers model performance, especially for RandomForest.
* On both cleaned and dirty datasets, XGBoost outperforms RandomForest in terms of all metrics. This suggests that XGBoost is a more robust algorithm, especially when dealing with noisy or unclean data.