# Import the repository from GitHub

First of all we start by importing the repository that we stored in the github project.


In [1]:
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
secret_value_1 = user_secrets.get_secret("NEW_GITHUB_TOKEN")

In [2]:
token = UserSecretsClient().get_secret("NEW_GITHUB_TOKEN")
! git clone https://{token}@github.com/madratak/DIQ_Project2024.git

Cloning into 'DIQ_Project2024'...
remote: Enumerating objects: 80, done.[K
remote: Counting objects: 100% (80/80), done.[K
remote: Compressing objects: 100% (67/67), done.[K
remote: Total 80 (delta 27), reused 13 (delta 2), pack-reused 0 (from 0)[K
Unpacking objects: 100% (80/80), 692.91 KiB | 5.73 MiB/s, done.


In [3]:
%cd /kaggle/working/DIQ_Project2024

/kaggle/working/DIQ_Project2024


# Set up the dataset

At this point we can proceed by importing the correct libraries and then importing the data itself inside our notebook.

In [4]:
import pandas as pd
import numpy as np
from datetime import datetime
import os

In [50]:
SERVICES = pd.read_csv('/kaggle/working/DIQ_Project2024/data/raw/Comune-di-Milano-Servizi-alla-persona-parrucchieri-estetisti.csv',sep=';',encoding='unicode_escape')
SERVICES.head()

Unnamed: 0,Tipo esercizio pa,Ubicazione,Tipo via,Via,Civico,Codice via,ZD,Prevalente,Superficie altri usi,Superficie lavorativa
0,,LGO DEI GELSOMINI N. 10 (z.d. 6),LGO,DEI GELSOMINI,10,5394.0,6,,,55.0
1,,PZA FIDIA N. 3 (z.d. 9),PZA,FIDIA,3,1144.0,9,CENTRO MASSAGGI RILASSANTI NON ESTETICI,2.0,28.0
2,,VIA ADIGE N. 10 (z.d. 5),VIA,ADIGE,10,4216.0,5,CENTRO BENESSERE,2.0,27.0
3,,VIA BARACCHINI FLAVIO N. 9 (z.d. 1),VIA,BARACCHINI FLAVIO,9,356.0,1,TRUCCO SEMIPERMANENTE,,
4,,VIA BERGAMO N. 12 (z.d. 4),VIA,BERGAMO,12,3189.0,4,,,50.0


# Data inspection

In this section we inserted all those operations that aim at unerstanding the content of our dataset:
Total number of records in the dataset, datatype of each column, number of unique values for each attribute, ecc. 

In [51]:
print("\nThe shape of our dataset (rows, columns) is the following:\n", SERVICES.shape)

print("\nThe data types of the different columns are the following:\n", SERVICES.dtypes)


The shape of our dataset (rows, columns) is the following:
 (3909, 10)

The data types of the different columns are the following:
 Tipo esercizio pa         object
Ubicazione                object
Tipo via                  object
Via                       object
Civico                    object
Codice via               float64
ZD                        object
Prevalente                object
Superficie altri usi     float64
Superficie lavorativa    float64
dtype: object


Now the next thing to do is to check if our dataset contains rows that are complete copies of previous ones. Once we identify these errors we can proceed and drop them

In [52]:
duplicated = SERVICES[SERVICES.duplicated()]
print(duplicated)

#Now we can eliminate these duplicated rows to start cleaning our dataset
D_SERVICES = SERVICES.drop_duplicates()
    

   Tipo esercizio pa                   Ubicazione Tipo via        Via Civico  \
88      Acconciatore  VIA CORREGGIO N. 8 (z.d. 7)      VIA  CORREGGIO      8   

    Codice via ZD    Prevalente  Superficie altri usi  Superficie lavorativa  
88      6287.0  7  ACCONCIATORE                   NaN                    NaN  


Now that we have eliminated the duplicated rows we can start truly inspecting the dataset.

First of all let's check the ***COMPLETENESS***.

Completeness is defined as the total number of not null cells devided by the total number of cells.

In [53]:
#Here we visualize the table with all the null-value cells
D_SERVICES.isnull()

Unnamed: 0,Tipo esercizio pa,Ubicazione,Tipo via,Via,Civico,Codice via,ZD,Prevalente,Superficie altri usi,Superficie lavorativa
0,True,False,False,False,False,False,False,True,True,False
1,True,False,False,False,False,False,False,False,False,False
2,True,False,False,False,False,False,False,False,False,False
3,True,False,False,False,False,False,False,False,True,True
4,True,False,False,False,False,False,False,True,True,False
...,...,...,...,...,...,...,...,...,...,...
3904,False,False,False,False,False,False,False,True,True,True
3905,False,False,False,False,False,False,False,True,True,False
3906,False,False,False,False,False,False,False,True,True,True
3907,False,False,False,False,True,False,False,True,True,True


In [54]:
#display the number of not null values for each column
print ("The total number of not null values for each column is:\n", SERVICES.count())

#total number of not null values
not_null_values = SERVICES.count().sum()
#number of null values for each column
null_values = SERVICES.isnull().sum()
#total number of null values
total_number_of_null = null_values.sum()
#total number of cells
n_cells = SERVICES.shape[0] * SERVICES.shape[1]

completeness = (n_cells - total_number_of_null) / n_cells
# completeness = not_null_values / n_cells
print("\nTHE COMPLETENESS OF OUR DATASET IS: ", completeness)

The total number of not null values for each column is:
 Tipo esercizio pa        3878
Ubicazione               3909
Tipo via                 3908
Via                      3908
Civico                   3832
Codice via               3908
ZD                       3908
Prevalente                294
Superficie altri usi      745
Superficie lavorativa    2601
dtype: int64

THE COMPLETENESS OF OUR DATASET IS:  0.7902532617037605


Our dataset **does not need** the analysis of the ***TIMELINESS*** however we can still evaluate the *CONSISTENCY* and the *ACCURACY* of our data.



Lets now check the *CONSISTENCY*

Consistency checks that data follows a set of user-defined rules. Considering che context of our dataset some rules that we can try to enforce on it are: 
- Values of the ZD column must be in the range 1-9 since those are the "municipi" of the city of Milan;
- Values in both "superficie lavorativa" and "superficie altri usi" must have positive values 

In [56]:
#Let's start by checking that all values in ZD fall in the range 1-9

    #First of all we define the list of allowed values (as strings, since the column is of type 'object')
allowed_values = {'1', '2', '3', '4', '5', '6', '7', '8', '9'}

    #Then we check if all values in the column are in the allowed set
all_valid = D_SERVICES['ZD'].isin(allowed_values).all()

    #Print the result
if all_valid:
    print("All values in the column are valid.")
else:
    print("There are invalid values in the column.")

    #We also print the columns that break this rule to visualize the errors
invalid_values = D_SERVICES[~D_SERVICES['ZD'].isin(allowed_values)]
print("Invalid rows:\n", invalid_values)



There are invalid values in the column.
Invalid rows:
    Tipo esercizio pa                            Ubicazione Tipo via  Via  \
32      Acconciatore  CSO COMO N. 15 interno club f. conti      NaN  NaN   
33          (z.d. 9)                                   CSO     COMO   15   

   Civico  Codice via            ZD Prevalente  Superficie altri usi  \
32    NaN         NaN           NaN        NaN                   NaN   
33   1111         9.0  ACCONCIATORE        NaN                 195.0   

    Superficie lavorativa  
32                    NaN  
33                    NaN  


In [57]:
#Now we check that the street names are correct names of streets that exist in Milano
#in oder to do this we downloaded a dataset from Comune Di Milano page that contains the whole list updated to the year 2024

    #The data comes in two different files, one that has the street name and a code for the street type, and another that connects the street-type code,
    #with the actual name (Vle, Cso, Pza, ecc)
    #So the first this is to correctly merge these two datasets

TIPOVIA = pd.read_csv('/kaggle/working/DIQ_Project2024/data/external/TIPOVIA.csv',sep=';',encoding='unicode_escape')
STRADARIO = pd.read_csv('/kaggle/working/DIQ_Project2024/data/external/VIARIO_20241104.csv',sep=';',encoding='unicode_escape')

    #Now we merge the two datasets into a single one
MERGED_STRADARIO = pd.merge(STRADARIO, TIPOVIA, left_on='TIPO', right_on='CODICE')
    #Only a few of these attributes are of use for us so we can drop the others
MERGED_STRADARIO = MERGED_STRADARIO[['DESC_ABBREVIATA', 'DESCRITTIVO']]

MERGED_STRADARIO.head()

Unnamed: 0,DESC_ABBREVIATA,DESCRITTIVO
0,Pza,DEL DUOMO
1,Gll,VITTORIO EMANUELE II
2,Via,PIETRO CALDERON DE LA BARCA
3,Lgo,ARTURO TOSCANINI
4,Pza,DEL LIBERTY


In [58]:
#We need to convert everything to uppercase since the .isin() function is case-sensitive
MERGED_STRADARIO['DESC_ABBREVIATA'] = MERGED_STRADARIO['DESC_ABBREVIATA'].str.upper()
MERGED_STRADARIO.head()

Unnamed: 0,DESC_ABBREVIATA,DESCRITTIVO
0,PZA,DEL DUOMO
1,GLL,VITTORIO EMANUELE II
2,VIA,PIETRO CALDERON DE LA BARCA
3,LGO,ARTURO TOSCANINI
4,PZA,DEL LIBERTY


In [59]:
#Now that we have our list of streets we can use it as reference to check whether or not our analyzed dataset contains errors

#We specify for both fields the column to check and the reference one
street_column_to_check = 'Via'
street_name_reference = 'DESCRITTIVO'

type_column_to_check = 'Tipo via'
type_reference = 'DESC_ABBREVIATA'

#Check if the values in each column belong to the respective reference column
is_street_valid = D_SERVICES[street_column_to_check].isin(MERGED_STRADARIO[street_name_reference])
is_type_valid = D_SERVICES[type_column_to_check].isin(MERGED_STRADARIO[type_reference])

#We combine the results
all_valid = is_street_valid & is_type_valid

#Check overall validity
if all_valid.all():
    print("All values in both columns are valid.")
else:
    print("Some values in one or both columns are not valid.")
    
    #In the case that we encountered errors we specify in wich rows the errors have been detected
    invalid_rows = D_SERVICES[~all_valid]
    print("Invalid rows:\n", invalid_rows)

    # Optional: Save invalid rows to a file
    invalid_rows.to_csv('invalid_rows.csv', index=False)

#Optional: Check invalid rows per column
invalid_streets = D_SERVICES[~is_street_valid]
invalid_types = D_SERVICES[~is_type_valid]

print("Invalid STREETS:\n", invalid_streets[['Via']])
print("Invalid TYPE:\n", invalid_types[['Tipo via']])

Some values in one or both columns are not valid.
Invalid rows:
                                       Tipo esercizio pa  \
3                                                   NaN   
5                                                   NaN   
6                                                   NaN   
7                                                   NaN   
8                                                   NaN   
...                                                 ...   
3896                     TIPO C TRATT.ESTETICI DIMAGRIM   
3897                     TIPO C TRATT.ESTETICI DIMAGRIM   
3901  TIPO C TRATT.ESTETICI DIMAGRIM;TIPO B CENTRO D...   
3903                     TIPO D ESTET.APPAR.ELETTROMECC   
3904  TIPO D ESTET.APPAR.ELETTROMECC;TIPO C TRATT.ES...   

                                             Ubicazione Tipo via  \
3                   VIA BARACCHINI FLAVIO N. 9 (z.d. 1)      VIA   
5                   VIA BOTTEGO VITTORIO N. 13 (z.d. 2)      VIA   
6                 VIA 

In [61]:
#And now we check that all values in both "superficie lavorativa" and "superficie altri usi" are non-negative
    #List of columns to check
columns_to_check = ['Superficie altri usi', 'Superficie lavorativa']  

    #Create a mask for positive values
all_positive = (D_SERVICES[columns_to_check] >= 0).all().all()

    #Print the result
if all_positive:
    print("All values in the specified columns are positive.")
else:
    print("There are negative or zero values in the specified columns.")

# Ensure the columns contain only numeric data
negative_rows = D_SERVICES.loc[
    (D_SERVICES[columns_to_check].select_dtypes(include=[float, int]) < 0).any(axis=1)
]

# Display rows with negative values
print("Rows with negative values:\n", negative_rows)


There are negative or zero values in the specified columns.
Rows with negative values:
 Empty DataFrame
Columns: [Tipo esercizio pa, Ubicazione, Tipo via, Via, Civico, Codice via, ZD, Prevalente, Superficie altri usi, Superficie lavorativa]
Index: []


Finally we can look at *ACCURACY*

Accuracy comes in two forms: Semantic, that checks whether or not a value has a real meaning in the real world; Syntactic, that checks whether or not a value is correct syntactically, that has no errors.

While we don't have

# Data profiling

*COLUMN ANALYSIS*

Continuing our inspection of the data we can analize column-by-column the dataset, inspecting the number of unique values for each of them and couting the most and least common ones. This kind of analysis will be useful later in our work. 

In [67]:
#Using a loop we display for each column in the dataset the number of different values and the list in order of frequency
for column in D_SERVICES.columns:
    print(column, "\n")
    print(f"The column '{column}' has the following number of different values: ", D_SERVICES[column].nunique())
    print(f"And these are the values listed from the most frequent to the least for '{column}':")
    print(SERVICES[column].value_counts())
    print("\n" + "-"*50 + "\n")  # Adds a separator between the columns

Tipo esercizio pa 

The column 'Tipo esercizio pa' has the following number of different values:  103
And these are the values listed from the most frequent to the least for 'Tipo esercizio pa':
Tipo esercizio pa
Parrucchiere per signora                           1048
ACCONCIATORE                                        586
Parrucchiere per uomo                               439
TIPO A - REG.2003                                   335
TIPO A - REG.2003;TIPO B CENTRO DI ABBRONZATURA     313
                                                   ... 
TIPO A-B-C-D;Acconciatore                             1
TIPO A-B-C-D;ACCONCIATORE                             1
TIPO A-B-C-D;Estetista in profumeria                  1
TIPO A ESTETICA MANUALE;Acconciatore                  1
Truccatore                                            1
Name: count, Length: 103, dtype: int64

--------------------------------------------------

Ubicazione 

The column 'Ubicazione' has the following number of different valu

# Wrangling

THis operation serves to change the name of the column in order to have more fitting descriptions of each attribute

In [47]:

SERVICES = SERVICES.rename(columns={
    'Tipo esercizio pa': 'store_type',
    'Ubicazione': 'address',
    'Tipo via': 'street_type',
    'Via': 'street',
    'Civico': 'number',
    'Codice via': 'street_code',
    'ZD': 'zd',
    'Prevalente': 'main_activity',
    'Superficie altri usi': 'secondary_space',
    'Superficie lavorativa': 'main_space'
})
SERVICES.head()

Unnamed: 0,store_type,address,street_type,street,number,street_code,zd,main_activity,secondary_space,main_space
0,,LGO DEI GELSOMINI N. 10 (z.d. 6),LGO,DEI GELSOMINI,10,5394.0,6,,,55.0
1,,PZA FIDIA N. 3 (z.d. 9),PZA,FIDIA,3,1144.0,9,CENTRO MASSAGGI RILASSANTI NON ESTETICI,2.0,28.0
2,,VIA ADIGE N. 10 (z.d. 5),VIA,ADIGE,10,4216.0,5,CENTRO BENESSERE,2.0,27.0
3,,VIA BARACCHINI FLAVIO N. 9 (z.d. 1),VIA,BARACCHINI FLAVIO,9,356.0,1,TRUCCO SEMIPERMANENTE,,
4,,VIA BERGAMO N. 12 (z.d. 4),VIA,BERGAMO,12,3189.0,4,,,50.0
