# 0: Starting the Project 🏴 

In this stage, the initial dataset, all tools and nesessary libraries are loaded for the project.

In addition, a general inspection of the dataset is done to get an "idea", "a feel", "an understanding" of the data.

## 0.1: Importing Process 🔃 

Necessary libraries, files, and configurations are made.

Although some prefer to load their libraries around where its use takes place, I like to load every library in the beginning. 

> Seeing all the imported libraries at the start, gives a general idea of the type of work expected.

In [1]:
# Library Imports
import json
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt
from scipy.spatial import distance

# Configurations
mpl.rcParams['figure.dpi'] = 600
pd.options.mode.chained_assignment = None  # default='warn'

## 0.2: General Data Inspection 🔍

This is where I conduct a general inspection of the data to understand it, and answer questions such as:

1. What type of dataset do I have?
2. Does it feel complete?
3. What are the data features I am looking at?
4. Right of the bat, can I make any glaring conclusions?
5. What tools will I need? (Cloud tools, because the data is too big, etc.)

> This stage is simply **reconnaissance**.

In [2]:
# Read the dataframe into a variable
df = pd.read_csv('assets/dataset.csv')

# What are the data features/columns and associated data types, no of entries, memory used?
print(df.info())

# Let me look at the first few entries of the data.
display(df.head())

  exec(code_obj, self.user_global_ns, self.user_ns)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1083397 entries, 0 to 1083396
Data columns (total 42 columns):
 #   Column                             Non-Null Count    Dtype  
---  ------                             --------------    -----  
 0   restaurant_link                    1083397 non-null  object 
 1   restaurant_name                    1083397 non-null  object 
 2   original_location                  1083397 non-null  object 
 3   country                            1083397 non-null  object 
 4   region                             1033074 non-null  object 
 5   province                           742765 non-null   object 
 6   city                               682712 non-null   object 
 7   address                            1083397 non-null  object 
 8   latitude                           1067607 non-null  float64
 9   longitude                          1067607 non-null  float64
 10  claimed                            1081555 non-null  object 
 11  awards                  

Unnamed: 0,restaurant_link,restaurant_name,original_location,country,region,province,city,address,latitude,longitude,...,excellent,very_good,average,poor,terrible,food,service,value,atmosphere,keywords
0,g10001637-d10002227,Le 147,"[""Europe"", ""France"", ""Nouvelle-Aquitaine"", ""Ha...",France,Nouvelle-Aquitaine,Haute-Vienne,Saint-Jouvent,"10 Maison Neuve, 87510 Saint-Jouvent France",45.961674,1.169131,...,2.0,0.0,0.0,0.0,0.0,4.0,4.5,4.0,,
1,g10001637-d14975787,Le Saint Jouvent,"[""Europe"", ""France"", ""Nouvelle-Aquitaine"", ""Ha...",France,Nouvelle-Aquitaine,Haute-Vienne,Saint-Jouvent,"16 Place de l Eglise, 87510 Saint-Jouvent France",45.95704,1.20548,...,2.0,2.0,1.0,0.0,0.0,,,,,
2,g10002858-d4586832,Au Bout du Pont,"[""Europe"", ""France"", ""Centre-Val de Loire"", ""B...",France,Centre-Val de Loire,Berry,Rivarennes,"2 rue des Dames, 36800 Rivarennes France",46.635895,1.386133,...,3.0,1.0,0.0,0.0,0.0,,,,,
3,g10002986-d3510044,Le Relais de Naiade,"[""Europe"", ""France"", ""Nouvelle-Aquitaine"", ""Co...",France,Nouvelle-Aquitaine,Correze,Lacelle,"9 avenue Porte de la Correze 19170, 19170 Lace...",45.64261,1.82446,...,1.0,0.0,0.0,0.0,0.0,4.5,4.5,4.5,,
4,g10022428-d9767191,Relais Du MontSeigne,"[""Europe"", ""France"", ""Occitanie"", ""Aveyron"", ""...",France,Occitanie,Aveyron,Saint-Laurent-de-Levezou,"route du Montseigne, 12620 Saint-Laurent-de-Le...",44.20886,2.96047,...,4.0,7.0,0.0,0.0,0.0,4.5,4.5,4.5,,


In [3]:
# Are there any missing values? How many?
def CheckTotalMissingValueCount(df=df):
    column_labels = df.columns
    total_count = 0
    for label in column_labels:
        total_count = total_count + df[label].isna().sum()
    missing_data_ratio = total_count/(df.shape[0]*df.shape[1])
    return total_count, missing_data_ratio

print(CheckTotalMissingValueCount()[0], ' data cells are missing')
print(CheckTotalMissingValueCount()[1]*100, ' percent of all data cells is missing')

11154479  data cells are missing
24.51389779862168  percent of all data cells is missing


## 0.3: Closing Remarks 📜

**General Observation**

A few remarks could be made about this dataset:

- There are 42 total columns (or data features)
- 17 datatypes are float64 which can be losely understood as a number
- 25 datatypes are object, which could mean two things:
    - they are string objects
    - they are of a mixed data type


----

# 1.0: Data Cleaning 🧹

In this section, I will express, through work, my ability in performing data cleaning, and pre-processing activities.

## 1.1: Identify the columns with mixed data types

From the previous secion we had identified the datatypes of the columns: 

- 17 of which are _float64_ and the remaining 25 are of _object_ datatype.

Depending on the program that is being used to analyze the data, the types of data that we can fit our dataset into may differ. For example, _R_ program uses the _char_ datatype to identify  _object_ datatypes in _pandas_.

In Python however, we shall check the _object_ datatype:

1. It is possible that the column is ONLY _string_ data type, then we don't consider it as of a mixed datatype.
2. If they are of more than one data type (e.g. _float_ and _string_) then we consider it to be of **mixed data type**
3. Further more, if it is of **mixed data type**, it could be that they are *NaN* values, which are simply missing data.


>**Hypothesis**: _object_ datatype column contains either _string_ which is **NOT** of **mixed data type** or it is _NaN_ which is considered as **mixed data type**

**Strategy**
1. Implement a function to produce an array with the different datatypes for each column
2. Seperate all columns if they are of _object_ datatype
3. If the length of the array is greater than 1, then it is of mixed datatype


Further inspection of the *object* datatype answers the **mixed data type** problem and confirms or rejects hypothesis.

In [4]:
''' Get the datatype present in each column of a given dataset'''
def GetAllColumnsDataTypes(df=df):
    
    def GetEachColDataTypes(col_series=pd.Series):
        col_series = col_series.sort_values()
        dtypes = []
        for index,item in enumerate(col_series):
            if index == 0:
                dtypes.append(type(item))
            if dtypes[-1] is not type(item):
                dtypes.append(type(item))

        return dtypes

    cols_dtypes = []

    for col_name in df.columns:
        cols_dtypes.append({ 
            'col_name':col_name,
            'dtypes':GetEachColDataTypes(df[col_name])
            })
    
    return cols_dtypes

all_cols_dtypes = GetAllColumnsDataTypes()

# Run the funtion above to check what data types are present in the columns
for col in all_cols_dtypes:
    print(col)

{'col_name': 'restaurant_link', 'dtypes': [<class 'str'>]}
{'col_name': 'restaurant_name', 'dtypes': [<class 'str'>]}
{'col_name': 'original_location', 'dtypes': [<class 'str'>]}
{'col_name': 'country', 'dtypes': [<class 'str'>]}
{'col_name': 'region', 'dtypes': [<class 'str'>, <class 'float'>]}
{'col_name': 'province', 'dtypes': [<class 'str'>, <class 'float'>]}
{'col_name': 'city', 'dtypes': [<class 'str'>, <class 'float'>]}
{'col_name': 'address', 'dtypes': [<class 'str'>]}
{'col_name': 'latitude', 'dtypes': [<class 'float'>]}
{'col_name': 'longitude', 'dtypes': [<class 'float'>]}
{'col_name': 'claimed', 'dtypes': [<class 'str'>, <class 'float'>]}
{'col_name': 'awards', 'dtypes': [<class 'str'>, <class 'float'>]}
{'col_name': 'popularity_detailed', 'dtypes': [<class 'str'>, <class 'float'>]}
{'col_name': 'popularity_generic', 'dtypes': [<class 'str'>, <class 'float'>]}
{'col_name': 'top_tags', 'dtypes': [<class 'str'>, <class 'float'>]}
{'col_name': 'price_level', 'dtypes': [<class 

In [5]:
'''Separate all columns that have mixed data types present in them'''
def GetMixedDataTypeColumns(all_cols_dtypes = all_cols_dtypes):
    mixed_dtype_cols = []
    for col_dtypes in all_cols_dtypes:
        if len(col_dtypes['dtypes']) > 1:
            mixed_dtype_cols.append(col_dtypes)
    return mixed_dtype_cols

# Run the function above to filter and retrieve only the columns with mixed datatypes
mixed_dtype_cols = GetMixedDataTypeColumns()
mixed_cols_names = [name['col_name'] for name in mixed_dtype_cols]

# Printing all columns with mixed datatypes
for index, col in enumerate(mixed_dtype_cols, start=1):
    print(index,'\t', col)

1 	 {'col_name': 'region', 'dtypes': [<class 'str'>, <class 'float'>]}
2 	 {'col_name': 'province', 'dtypes': [<class 'str'>, <class 'float'>]}
3 	 {'col_name': 'city', 'dtypes': [<class 'str'>, <class 'float'>]}
4 	 {'col_name': 'claimed', 'dtypes': [<class 'str'>, <class 'float'>]}
5 	 {'col_name': 'awards', 'dtypes': [<class 'str'>, <class 'float'>]}
6 	 {'col_name': 'popularity_detailed', 'dtypes': [<class 'str'>, <class 'float'>]}
7 	 {'col_name': 'popularity_generic', 'dtypes': [<class 'str'>, <class 'float'>]}
8 	 {'col_name': 'top_tags', 'dtypes': [<class 'str'>, <class 'float'>]}
9 	 {'col_name': 'price_level', 'dtypes': [<class 'str'>, <class 'float'>]}
10 	 {'col_name': 'price_range', 'dtypes': [<class 'str'>, <class 'float'>]}
11 	 {'col_name': 'meals', 'dtypes': [<class 'str'>, <class 'float'>]}
12 	 {'col_name': 'cuisines', 'dtypes': [<class 'str'>, <class 'float'>]}
13 	 {'col_name': 'special_diets', 'dtypes': [<class 'str'>, <class 'float'>]}
14 	 {'col_name': 'features

### ✅ **Results and Conclusion**

We have identified all the columns (17 of them) with mixed data type.

1. region
2. province
3. city
4. claimed
5. awards
6. popularity_detailed
7. popularity_generic
8. top_tags
9. price_level
10. price_range
11. meals
12. cuisines
13. special_diets
14. features
15. original_open_hours
16. default_language
17. keywords

*['region', 'province', 'city', 'claimed', 'awards', 'popularity_detailed', 'popularity_generic', 'top_tags', 'price_level', 'price_range', 'meals', 'cuisines', 'special_diets', 'features', 'original_open_hours', 'default_language', 'keywords']*