## Part 1: Tidying and Reshaping Data

### Part 1.1

In [1]:
import numpy as np
import pandas as pd

# Read data from CSV into a DataFrame
data = pd.read_csv("https://raw.githubusercontent.com/ktxdev/AIM-5001/main/M11/1.%20Data/M11_Data.csv")

# Show first few rows of the DataFrame
data.head()

Unnamed: 0,Month,Category,Caltex,Gulf,Mobil
0,Open,Engine Oil,140 : 000,199 : 000,141 : 000
1,,GearBox Oil,198 : 000,132 : 000,121 : 000
2,Jan,Engine Oil,170 : 103,194 : 132,109 : 127
3,,GearBox Oil,132 : 106,125 : 105,191 : 100
4,Feb,Engine Oil,112 : 133,138 : 113,171 : 101


This code uses the `pd.melt()` function to reshape the data frame, transforming it from wide to long format. The columns 'Month' and 'Category' serve as identification variables, guaranteeing that they remain unchanged. A new column titled 'Suppliers' is being formed to hold the variable names that are scattered throughout the remaining columns, as well as a new column named 'Purchased:Consumed' to contain the values matching these variable names. The code then uses the `data.head()` function to display the first few rows after the melting process is complete.

In [2]:
# Convert Dataframe from 'wide' to 'long' format
data = pd.melt(data, id_vars=['Month', 'Category'], var_name="Suppliers", value_name="Purchased:Consumed")
# Show first few rows of the new DataFrame
data.head()

Unnamed: 0,Month,Category,Suppliers,Purchased:Consumed
0,Open,Engine Oil,Caltex,140 : 000
1,,GearBox Oil,Caltex,198 : 000
2,Jan,Engine Oil,Caltex,170 : 103
3,,GearBox Oil,Caltex,132 : 106
4,Feb,Engine Oil,Caltex,112 : 133


This code splits the string in the column title 'Purchased:Consumed' using the `data.str.split` method and assigns the separated values to two new columns 'Purchased' and 'Consumed' in the dataframe. The code then drops the 'Purchased:Consumed' column and assigns the result to the data variable. The function `data.head()` is then used to display the first few rows of the dataframe, highlighting the modifications performed.  

In [3]:
# Split values in 'Purchased:Consumed' column into 2 different columns
data[['Purchased', 'Consumed']] = data['Purchased:Consumed'].str.split(":", expand=True)
# Drop the 'Purchased:Consumed' and store the result in data
data = data.drop(columns="Purchased:Consumed")
# Show first few rows of the new DataFrame
data.head()

Unnamed: 0,Month,Category,Suppliers,Purchased,Consumed
0,Open,Engine Oil,Caltex,140,0
1,,GearBox Oil,Caltex,198,0
2,Jan,Engine Oil,Caltex,170,103
3,,GearBox Oil,Caltex,132,106
4,Feb,Engine Oil,Caltex,112,133


This code uses the `ffill()` function to fill in the missing data in the 'Month' column. The assumption is that the last non-missing value above the current row is the correct value for the current row. The outcome of running this method is then used to replace the values in the 'Month' column. The code then utilizes the `data.head()` function to show the changes made to the data for the first few rows.

In [4]:
# Handle missing values for the 'Month' column
data['Month'] = data['Month'].ffill()
# Show first few rows of the new DataFrame
data.head()

Unnamed: 0,Month,Category,Suppliers,Purchased,Consumed
0,Open,Engine Oil,Caltex,140,0
1,Open,GearBox Oil,Caltex,198,0
2,Jan,Engine Oil,Caltex,170,103
3,Jan,GearBox Oil,Caltex,132,106
4,Feb,Engine Oil,Caltex,112,133


### Part 1.2

This code begins by casting the columns 'Purchased' and 'Consumed' to `int`s' using the `astype` function, ensuring that the columns contain numerical data. The code organizes the data in the data frame using the 'Month' and 'Category' columns. It then applies a lambda function to each group, calculating the difference between the total of 'Purchased' values and the sum of 'Consumed' values for each group to determine the remaining oil. The result is then put in the variable data1, which comprises a Series with an index corresponding to the unique combinations of 'Month' and 'Category' and values representing the calculated differences.

In [5]:
# Cast 'Purchased' column values to int
data['Purchased'] = data['Purchased'].astype(int)
# Cast 'Consumed' column values to int
data['Consumed'] = data['Consumed'].astype(int)
# Group and perfom som aggregation on the dataframe
data1 = data.groupby(['Month', 'Category']).apply(lambda x: x['Purchased'].sum() - x['Consumed'].sum())
# Display the results of the grouping
data1

Month  Category   
Apr    Engine Oil      -3
       GearBox Oil    116
Feb    Engine Oil      74
       GearBox Oil    132
Jan    Engine Oil     111
       GearBox Oil    137
Jun    Engine Oil     126
       GearBox Oil     61
Mar    Engine Oil      90
       GearBox Oil    134
May    Engine Oil      99
       GearBox Oil     62
Open   Engine Oil     480
       GearBox Oil    451
dtype: int64

The code groups the data by categories and then calculates the sum eaten for each category to answer the question: 'What was the most consumed brand of oil across the two separate categories/types of oil?'. According to the results of the code, GearBox Oil was the most utilized of the two, with a value of 2200.

In [6]:
# Group by 'Category' and calculate the sum consumed for every category
data.groupby('Category')['Consumed'].sum()

Category
Engine Oil     2191
GearBox Oil    2200
Name: Consumed, dtype: int64

### Part 1.3

To display the data in wide format, I would use the three categorical variables 'Month', 'Category', and 'Suppliers' as the index, and the columns 'Consumed' and 'Purchased' as values. This representation will be easier to comprehend because the actual values will be contained in the two columns 'Consumed' and 'Purchased', and it will be simple to read the values for a certain provider for a given category in a particular month. To do this, the code below uses the `data.pivot_table` with 'Month', 'Category', and 'Suppliers' as the index and 'Consumed' and 'Purchased' as the values columns, with the values aggregated using the sum function.

In [7]:
# Create pivot table to display the data in a 'wide' format
data.pivot_table(index=['Month', 'Category', 'Suppliers'], values=['Consumed','Purchased'], aggfunc = 'sum')

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Consumed,Purchased
Month,Category,Suppliers,Unnamed: 3_level_1,Unnamed: 4_level_1
Apr,Engine Oil,Caltex,150,149
Apr,Engine Oil,Gulf,118,117
Apr,Engine Oil,Mobil,118,117
Apr,GearBox Oil,Caltex,125,185
Apr,GearBox Oil,Gulf,133,191
Apr,GearBox Oil,Mobil,121,119
Feb,Engine Oil,Caltex,133,112
Feb,Engine Oil,Gulf,113,138
Feb,Engine Oil,Mobil,101,171
Feb,GearBox Oil,Caltex,148,193


## Part 2: Using Your GroupBy and Data Aggregation Skills

### Load Data

In [8]:
import numpy as np
import pandas as pd

# load the data set
auto_df = pd.read_csv("https://raw.githubusercontent.com/ktxdev/AIM-5001/main/M11/1.%20Data/auto-mpg.data", delim_whitespace = True, header = None)

# add meaningful column names
auto_df.columns = ['mpg', 'cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'model', 'origin', 'car_name']

# replace '?' in horsepower column with 'NaN'
auto_df.horsepower.replace('?', np.nan, inplace = True)

# convert the column to numeric
auto_df["horsepower"] = pd.to_numeric(auto_df["horsepower"])

# replace origin values using a dict
auto_df.origin.replace({1: 'USA', 2: 'Asia', 3: 'Europe'}, inplace = True)
auto_df.head(10)

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model,origin,car_name
0,18.0,8,307.0,130.0,3504.0,12.0,70,USA,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693.0,11.5,70,USA,buick skylark 320
2,18.0,8,318.0,150.0,3436.0,11.0,70,USA,plymouth satellite
3,16.0,8,304.0,150.0,3433.0,12.0,70,USA,amc rebel sst
4,17.0,8,302.0,140.0,3449.0,10.5,70,USA,ford torino
5,15.0,8,429.0,198.0,4341.0,10.0,70,USA,ford galaxie 500
6,14.0,8,454.0,220.0,4354.0,9.0,70,USA,chevrolet impala
7,14.0,8,440.0,215.0,4312.0,8.5,70,USA,plymouth fury iii
8,14.0,8,455.0,225.0,4425.0,10.0,70,USA,pontiac catalina
9,15.0,8,390.0,190.0,3850.0,8.5,70,USA,amc ambassador dpl


### Part 2.1

The code employs the 'groupby' function to generate distinct combinations of values in the 'origin' and 'cylinders' columns. It then uses the'size' aggregation function to count the number of rows in each group using the 'cylinders' column. The end result is a dataframe with a hierarchical column index, where the outer level is 'cylinders' and the inner level is'size'. To achieve the desired result, the code used the'rename' function with a 'dict', stating that the'size' column be renamed to 'Quantity' and the 'cylinders' columns be renamed to an empty string, which effectively removes the outer level of the outer level of the hierachical column index.

In [9]:
# Group values in dataframe
result_df = auto_df.groupby(['origin', 'cylinders']).agg({'cylinders': ['size']})
# Rename the resulting data frame
result_df.rename(columns={'size': 'Quantity', 'cylinders': ''})

Unnamed: 0_level_0,Unnamed: 1_level_0,Quantity
origin,cylinders,Unnamed: 2_level_1
Asia,4,63
Asia,5,3
Asia,6,4
Europe,3,4
Europe,4,69
Europe,6,6
USA,4,72
USA,6,74
USA,8,103


### Part 2.2

The code uses the 'groupby' function to create unique combinations of values in the 'origin' and'model' columns. It then utilizes the'mean' aggregation function to calculate the averages for both the'mpg' and 'weight' columns, which yields the desired result dataframe.

In [10]:
# Group and perfom aggreagtion on 'mpg' and 'weight'
auto_df.groupby(['origin', 'model']).agg({'mpg': 'mean', 'weight': "mean"})

Unnamed: 0_level_0,Unnamed: 1_level_0,mpg,weight
origin,model,Unnamed: 2_level_1,Unnamed: 3_level_1
Asia,70,25.2,2309.2
Asia,71,28.75,2024.0
Asia,72,22.0,2573.2
Asia,73,24.0,2335.714286
Asia,74,27.0,2139.333333
Asia,75,24.5,2571.166667
Asia,76,24.25,2611.0
Asia,77,29.25,2138.75
Asia,78,24.95,2691.666667
Asia,79,30.45,2693.75


### Part 2.3

The code begins by defining the bins and names for binning the 'model' column. The code then utilizes the `pd.cut` method to group the values in the 'model' column. It then updates the 'model' column with the bin labels associated with each value in the original 'model' column. The code divides the data into groups based on the 'model' column and applies aggregation functions to the 'weight' column in each group. Aggregation is performed on the 'weight' columns' mean, size, median, minimum, and maximum values. The result of this aggregation is saved in the `result_df` variable. The columns in the `result_df` are then renamed using a provided dictionary, which maps old names to new ones required in the final table. It then utilizes the `stack()` function to turn the column index representing the aggregate results into the innermost level of the resultant MultiIndex, as required.

In [11]:
# Define bins
bins = [70, 72, 75, 77, 79, 82]
# Define bin labels
bin_labels = ["(70.0, 72.0]", "(72.0, 75.0]", "(75.0, 77.0]", "(77.0, 80.0]", "(80.0, 82.0]"]
# Perfom binning on the 'model' column
auto_df['model'] = pd.cut(auto_df['model'], bins=bins, labels=bin_labels)
# Group and perfom aggreagations on 'weight'
result_df = auto_df.groupby("model").agg({'weight': ['mean', 'size', 'max', 'median', 'min']})
# Rename columns and stack data
result_df.rename(columns={'mean': 'Average Weight', 'size': 'Count', 'max': 'Max Weight', 'median': 'Median Weight', 'min': 'Min Weight'}).stack()

Unnamed: 0_level_0,Unnamed: 1_level_0,weight
model,Unnamed: 1_level_1,Unnamed: 2_level_1
"(70.0, 72.0]",Average Weight,3116.571429
"(70.0, 72.0]",Count,56.0
"(70.0, 72.0]",Max Weight,5140.0
"(70.0, 72.0]",Median Weight,2947.5
"(70.0, 72.0]",Min Weight,1613.0
"(72.0, 75.0]",Average Weight,3193.494845
"(72.0, 75.0]",Count,97.0
"(72.0, 75.0]",Max Weight,4997.0
"(72.0, 75.0]",Median Weight,3021.0
"(72.0, 75.0]",Min Weight,1649.0
