# <span style="color:#0F19C9">Contents</span>

- [Importing and loading data](#importing-and-loading-data)
- [Understanding data](#understanding-data)

# <span style="color:#0F19C9">Importing and loading data</span>

In [1]:
import pandas as pd
import os

import matplotlib.pyplot as plt

In [2]:
# Import my color palette
juan_colors = ['#101B4B', '#545E85', '#A3A8B2',
               '#E7E7E7', '#0F19C9', '#F6D673']

# Setting plot font
plt.rc('font', family='Georgia', size=12)

In [20]:
# Give the route of the main folder and get the files names
folder = '../Data/Raw/'
files = [file for file in os.listdir(folder)]

# Read each csv file, create the dataframes and a dictionary
dataframes = {}
for file in files:
    name = file.split('.')[0]  # Get the name without the extension
    route = folder + name + '.csv'
    dataframes[name] = pd.read_csv(route)

# <span style="color:#0F19C9">Understanding data</span>

In [24]:
# Show dataframes basic info
names = [dataframe for dataframe in dataframes.keys()]
count_rows = [dataframe.shape[0] for dataframe in dataframes.values()]
count_columns = [dataframe.shape[1] for dataframe in dataframes.values()]
count_null = [dataframe.isna().sum().sum()
              for dataframe in dataframes.values()]
count_duplicates = [dataframe.duplicated().sum()
                    for dataframe in dataframes.values()]

# Write data in dataframe
info = {'Dataframe_Name': names,
        'Rows': count_rows,
        'Columns': count_columns,
        'Null_Values': count_null,
        'Duplicated_Values': count_duplicates}
basic_info = pd.DataFrame(info)
basic_info

Unnamed: 0,Dataframe_Name,Rows,Columns,Null_Values,Duplicated_Values
0,aisles,134,2,0,0
1,departments,21,2,0,0
2,orders_1,2000000,7,120182,0
3,orders_2,1421083,7,86027,0
4,orders_products__prior_1,2000000,4,0,0
5,orders_products__prior_10,2000000,4,0,0
6,orders_products__prior_11,2000000,4,0,0
7,orders_products__prior_12,2000000,4,0,0
8,orders_products__prior_13,2000000,4,0,0
9,orders_products__prior_14,2000000,4,0,0


In [26]:
# Find the column with null values
dataframes['orders_1'].isna().sum().sort_values(ascending=False).index[0]

'days_since_prior_order'

In [27]:
# Find the column with null values
dataframes['orders_2'].isna().sum().sort_values(ascending=False).index[0]

'days_since_prior_order'

In [30]:
# Find the column names
for dataframe in dataframes.values():
    dataframe.columns = [old_col.title() for old_col in dataframe.columns]
    print(f'Column names: {dataframe.columns.to_list()}')

Column names: ['Aisle_Id', 'Aisle']
Column names: ['Department_Id', 'Department']
Column names: ['Order_Id', 'User_Id', 'Eval_Set', 'Order_Number', 'Order_Dow', 'Order_Hour_Of_Day', 'Days_Since_Prior_Order']
Column names: ['Order_Id', 'User_Id', 'Eval_Set', 'Order_Number', 'Order_Dow', 'Order_Hour_Of_Day', 'Days_Since_Prior_Order']
Column names: ['Order_Id', 'Product_Id', 'Add_To_Cart_Order', 'Reordered']
Column names: ['Order_Id', 'Product_Id', 'Add_To_Cart_Order', 'Reordered']
Column names: ['Order_Id', 'Product_Id', 'Add_To_Cart_Order', 'Reordered']
Column names: ['Order_Id', 'Product_Id', 'Add_To_Cart_Order', 'Reordered']
Column names: ['Order_Id', 'Product_Id', 'Add_To_Cart_Order', 'Reordered']
Column names: ['Order_Id', 'Product_Id', 'Add_To_Cart_Order', 'Reordered']
Column names: ['Order_Id', 'Product_Id', 'Add_To_Cart_Order', 'Reordered']
Column names: ['Order_Id', 'Product_Id', 'Add_To_Cart_Order', 'Reordered']
Column names: ['Order_Id', 'Product_Id', 'Add_To_Cart_Order', 'Reo

In [29]:
# Count the original dataframes
print(f'We start with {len(dataframes)} dataframes')

We start with 24 dataframes


We can work with a dictionary that contains the name of each file as key and the dataframe as value.

We only found that just two dataframes have null values: `orders_1` and `orders_2`. And both have the null values in the column `days_since_prior_order`. And we do not have duplicates in any dataframe.

Finally, we fixed the columns name writting the first letter in capital in every 24 dataframes.