### Car sales program:

##### Carry out the following tasks in JupyterLab:
##### Since there are many car models (especially their names are complicated and not comparable with each other) in the market, it is rather more important to include “category” than “model” in our analysis later.
##### Remove “model” from the DataFrame.

In [2]:
import pandas as pd
# You can also use the full path C:\Users\blah\blah\blah\blah\car_model_price.csv
#However, note that "/" direction (back-slash)
#If you dont want to worry about this, then use just use (r"C:\Users\blah\blah\blah\blah\car_model_price.csv")

car_model_price = pd.read_csv("../Lecture_4/car_model_price.csv")
car_model_price

Unnamed: 0,Year,Make,Model,Category,Price
0,2020,Audi,Q3,SUV,
1,2020,Chevrolet,Malibu,Sedan,
2,2020,Cadillac,Escalade ESV,SUV,
3,2020,Chevrolet,Corvette,"Coupe, Convertible",
4,2020,Acura,RLX,Sedan,
...,...,...,...,...,...
9831,2022,Mitsubishi,Eclipse Cross,SUV,
9832,2022,Nissan,Frontier Crew Cab,Pickup,
9833,2022,Nissan,Pathfinder,SUV,
9834,2022,Subaru,BRZ,Coupe,


In [3]:
car_dropmodel_price = car_model_price.drop("Model", axis = 1)
car_dropmodel_price

Unnamed: 0,Year,Make,Category,Price
0,2020,Audi,SUV,
1,2020,Chevrolet,Sedan,
2,2020,Cadillac,SUV,
3,2020,Chevrolet,"Coupe, Convertible",
4,2020,Acura,Sedan,
...,...,...,...,...
9831,2022,Mitsubishi,SUV,
9832,2022,Nissan,Pickup,
9833,2022,Nissan,SUV,
9834,2022,Subaru,Coupe,


##### Reduce the categories in “category” by assigning each car to a SOLE category if it has more than one in the original dataset such as “Convertible, Coupe” or “Coupe, Convertible”.

We should firstly determine how many cells are unique and what the unique values are. In order to do this, we can use the "get_Dummies" approach

In [4]:
car_dropmodel_price["Category"].str.get_dummies(sep=', ').sum()

Convertible                 964
Convertible,Coupe             1
Convertible,Sedan,Coupe       1
Coupe                      1253
Coupe,Convertible             3
Hatchback                   754
Hatchback,Sedan               3
Pickup                     1645
SUV                        2278
SUV1992                       1
SUV2020                       1
Sedan                      2817
Van/Minivan                1153
Wagon                       620
Wagon,Sedan                   1
dtype: int64

Well, there are a lot of categories, and I can see the ones with multiple categories have a comma in them (if they have 2 or more) categories. So it perhaps makes sense of split along that.

In [6]:
split_data = car_dropmodel_price["Category"].str.split(",")
data = split_data.to_list()
names = ["a", "b", "c","d"]
new_df = pd.DataFrame(data, columns=names).drop(['b','c','d'], axis = 1)
new_df.head()

Unnamed: 0,a
0,SUV
1,Sedan
2,SUV
3,Coupe
4,Sedan


So my new temp column 'a' now has a single category each (look at row 4 for instance)

In [7]:
car_dropmodel_price= pd.concat([car_dropmodel_price, new_df], axis = 1)
car_dropmodel_price = car_dropmodel_price.drop(['Category'], axis = 1)
car_dropmodel_price.rename(columns = {'a':'Category'}, inplace = True)
car_dropmodel_price

Unnamed: 0,Year,Make,Price,Category
0,2020,Audi,,SUV
1,2020,Chevrolet,,Sedan
2,2020,Cadillac,,SUV
3,2020,Chevrolet,,Coupe
4,2020,Acura,,Sedan
...,...,...,...,...
9831,2022,Mitsubishi,,SUV
9832,2022,Nissan,,Pickup
9833,2022,Nissan,,SUV
9834,2022,Subaru,,Coupe


Upon checking most are correct, except for "Van/Minivan", which is NOT seperated by a comma. That has to be replaced as well

And what about "SUV1992" & "SUV2020"? We can categorise them both as "SUV"

In [9]:
car_dropmodel_price["Category"].str.get_dummies(sep=', ').sum()

Convertible     459
Coupe           902
Hatchback       594
Pickup         1642
SUV            2274
SUV1992           1
SUV2020           1
Sedan          2510
Van/Minivan    1153
Wagon           300
dtype: int64

In [10]:
car_dropmodel_price['Category'] = car_dropmodel_price['Category'].str.replace('Van/Minivan', 'Van')
car_dropmodel_price['Category'] = car_dropmodel_price['Category'].str.replace('SUV1992', 'SUV')
car_dropmodel_price['Category'] = car_dropmodel_price['Category'].str.replace('SUV2020', 'SUV')
car_dropmodel_price["Category"].str.get_dummies(sep=', ').sum()

Convertible     459
Coupe           902
Hatchback       594
Pickup         1642
SUV            2276
Sedan          2510
Van            1153
Wagon           300
dtype: int64

At this stage, the data set is clean and ready for the next stage. This is a good place to create a checkpoint file

In [11]:
car_dropmodel_price.to_csv("car_dropmodel_price.csv", index = False)

##### Create a training and a testing dataset. The proportion here should be 0.7:0.3 and the random state should be set to 0.

In [14]:
# Sklearn Liraries
from sklearn import preprocessing
from sklearn.model_selection import train_test_split

##### Now we will create a train: test data set. Actually, in machine learning, you need 3 sets called a "train", "test" and "validate" set. 

1) Training Dataset: The sample of data used to fit the model.

2) Test Dataset: The sample of data used to provide an unbiased evaluation of a model fit on the training dataset while tuning model hyperparameters. The evaluation becomes more biased as skill on the test dataset is incorporated into the model configuration.

3) Validation Dataset (also known as an "out-of-sample" test set: The sample of data used to provide an unbiased evaluation of a final model fit on the training dataset. A validation dataset is often newly collected data you want to put through your machine learning model

(I wont cover how to get this, here because I dont want to deviate from your lessons, but note such things do exist)

##### There are 2 stages that you need to do when you want to crete a train-test split. The first stage is known as feature selection

In [12]:
# Feature Selection on data frame

#Create a copy
car_train_test_set=car_dropmodel_price.copy()

Feature = car_train_test_set[[
    'Year', 
    'Make', 
    'Category',
]]
x=Feature

y = car_train_test_set['Price'].values

print(x.head())
print(y[0:5])
print(x.shape, y.shape)

   Year       Make Category
0  2020       Audi      SUV
1  2020  Chevrolet    Sedan
2  2020   Cadillac      SUV
3  2020  Chevrolet    Coupe
4  2020      Acura    Sedan
[nan nan nan nan nan]
(9836, 3) (9836,)


##### Now we do the train test split proper. A random state of "0" is simply a seed to the random generator, so that your train-test splits are always deterministic. If you don't set a seed, it is different each time.

In [15]:
random_state = 0
test_size = 0.3

x_train, x_test, y_train, y_test  = train_test_split(
            x, y, test_size = test_size, random_state = random_state
)

print('Train Set: ', x_train.shape, y_train.shape)
print(x_train['Make'][0:5])
print("\n")
print('Test Set: ', x_test.shape, y_test.shape)
print(x_test['Make'][0:5])

Train Set:  (6885, 3) (6885,)
6017           GMC
7663        Subaru
828          Honda
6465        Suzuki
2310    Volkswagen
Name: Make, dtype: object


Test Set:  (2951, 3) (2951,)
4869     Chevrolet
9502       Mercury
9835    Volkswagen
6533     Chevrolet
9784        Toyota
Name: Make, dtype: object


##### Now I am going to teach you something new, called "pickling", which is basically a way to convert an object in memory to a byte stream that can be stored on disk or sent over a network. Useful within python, but should not be used across different programming languages

In [None]:
import pickle
# Pickle all sets and models

# Train and Test Set
with open('x_train', 'wb') as file:
    pickle.dump(x_train, file)
with open('x_test', 'wb') as file:
    pickle.dump(x_test, file)
with open('y_train', 'wb') as file:
    pickle.dump(y_train, file)
with open('y_test', 'wb') as file:
    pickle.dump(y_test, file)

For completeness this is how you now "unpickle" or extract the data you need

In [None]:
# Unpickle all the training and testing data sets

x_train = pickle.load(open("x_train", "rb"))
print('x Training Set: ', x_train[0:5])
print('Shape x Training Set: ', x_train.shape)
print('\n')

x_test = pickle.load(open("x_test", "rb"))
print('x Testing Set: ', x_test[0:5])
print('Shape x Testing Set: ', x_test.shape)
print('\n')

y_train  = pickle.load(open("y_train", "rb"))
#print('Y Training Set: ', y_train[0:5])
#print('Shape Y Training Set: ', y_train.shape)
#print('\n')

y_test  = pickle.load(open("y_test", "rb"))
#print('Y Testing Set: ', y_test[0:5])
#print('Shape Y Testing Set: ', y_test.shape)