<a href="https://colab.research.google.com/github/giakomorssi/Machine_Learning/blob/main/01_EDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Import the Data

In [None]:
from google.colab import drive
drive.mount('/content/drive')

import pandas as pd

# Change Colab runtime to GPU
import os
os.environ['COLAB_TPU_ADDR'] = ''
os.environ['COLAB_GPU_ALLOC'] = '1'
os.environ['COLAB_GPU'] = '1'
print("Runtime switched to GPU")

import tensorflow as tf

# This code sets the runtime to use the GPU if available
if tf.test.gpu_device_name():
    print('Default GPU Device: {}'.format(tf.test.gpu_device_name()))
else:
    print("Please install GPU version of TF")

pd.set_option('display.max_columns', None)

In [None]:
df = pd.read_csv('/content/drive/MyDrive/University/ML/customer_segmentation.csv')

Discuss correlations and how the data is distributed also using Visualization. In particular try to answer to these questions:

*   Looking the price do you think the dataset is balanced?
*   Looking the customer_city distribution do you think the dataset is balanced?

# Meaning of the Columns

* **order_id**: unique order identifier 
* **customer_id**: the key to the orders dataset. Each 
order has a unique customer_id 
* **customer_unique_id**: the unique identifier of a customer.
* **customer_city**: customer city name
* **customer_state**: customer state
* **order_item_id**: sequential number identifying the number of items included in the same order.
* **product_id**: product unique identifier
* **price**: item price
* **freight_value**: item freight value item (if an order has more than one item, the freight value is split between items)
* **payment_type**: method of payment chosen by the customer.
* **payment_installments**: number of installments chosen by the customer.
* **payment_value**: transaction value.
* **order_status**: the order status (delivered, shipped, etc).
* **order_purchase_timestamp**: purchase timestamp.
* **order_approved_at**: purchase approval timestamp.
* **order_delivered_carrier_date**: order posting timestamp. When it was handled by the logistic partner.
* **order_delivered_customer_date**: actual order delivery date to the customer.
* **order_estimated_delivery_date**: the estimated delivery date informed to the customer at the purchase moment.
* **shipping_limit_date**: the deadline by which the seller must have the order ready to be shipped
* **product_category_name**: root product category, in Portuguese.
* **product_category_name_english**: root category of product, in English
* **product_name_lenght**: number of characters extracted from the product name.
* **product_description_lenght**: number of characters extracted from the product description.
* **seller_id**: seller unique identifier
* **seller_city**: seller city name
* **seller_state**: seller state

# Rename the Columns

In [None]:
df.rename(columns={'order_purchase_timestamp': 'purchase_date', 
                    'order_approved_at': 'approved_date', 
                    'order_delivered_carrier_date': 'handled_by_logistic_date',
                    'order_delivered_customer_date': 'delivery_date',
                    'order_estimated_delivery_date': 'estimated_delivery_date',
                    'order_item_id': 'item_per_order'
                    }, inplace=True)
df.head()

# Clean the Data

1. Remove the `customer_id`, `order_id`, `customer_unique_id`, `order_item_id`, `product_id`, `seller_id`, `product_category_name` columns.

2. Convert the `order_status`, `payment_type`, `product_category_name_english` columns to a categorical variable.

3. Remove from `order_estimated_delivery_date` column the `time`

4. Encode the `customer_city`, `customer_state`, `seller_city`, `seller_state` columns.

In [None]:
df.drop(['order_id', 'customer_unique_id', 'product_id', 'seller_id', 'product_category_name'], axis=1, inplace=True)

In [None]:
#drop duplicates
df.drop_duplicates(inplace=True)

In [None]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

# Order Status
df['order_status'] = le.fit_transform(df['order_status'])

# payment_type
df['payment_type'] = le.fit_transform(df['payment_type'])

# product_category_name_english
df['product_category_name_english'] = le.fit_transform(df['product_category_name_english'])

# seller_city
df['seller_city'] = le.fit_transform(df['seller_city'])

# customer_state
df['customer_state'] = le.fit_transform(df['customer_state'])

# customer_city
df['customer_city'] = le.fit_transform(df['customer_city'])

# seller_state
df['seller_state'] = le.fit_transform(df['seller_state'])

In [None]:
df['purchase_date'] = pd.to_datetime(df['purchase_date']).astype(int)/10**9
df['approved_date'] = pd.to_datetime(df['approved_date']).astype(int)/10**9
df['handled_by_logistic_date'] = pd.to_datetime(df['handled_by_logistic_date']).astype(int)/10**9
df['delivery_date'] = pd.to_datetime(df['delivery_date']).astype(int)/10**9
df['estimated_delivery_date'] = pd.to_datetime(df['estimated_delivery_date']).astype(int)/10**9
df['shipping_limit_date'] = pd.to_datetime(df['shipping_limit_date']).astype(int)/10**9