# <span style='color:darkgreen'> Python for Data Analysis</span>
In this notebook, we will look into how to analyze data in Python using different libraries, including `pandas`, `numpy`, `matplotlib`, `seaborn`

We will look into different steps in analysing the data,
- **STEP 1**: starting from **data acquisition or import**,
- **STEP 2**: then doing **pre-processing** of the data,
- **STEP 3**: then doing **exploratory data analysis** to understand more about the data and different ways to **visualize** the outcome.
- **STEP 4**: Lastly, let's build a simple **ML model** for **predicting** the car price.

<hr>

## **<span style='color:darkred'>Machine Learning</span>**
In Python, machine learning can be implemented using the `scikit-learn` library. https://scikit-learn.org/stable/

<img src='https://scikit-learn.org/1.3/_static/ml_map.png' width=800px>


## **<span style='color:darkred'>STEP 4. Predicting the Car Price</span>**
Now, let's build a machine learning model using the cleaned data to predict the car price based on the given features of the car.

In general, to build a machine learning models, there are several steps:
1. Data preparation: even though our data is clean, but we still have to prepared the data in the way that it can be read by the algorithm.
2. Split the data into training and testing data.
3. Train the model using the training data.
4. Get the performance of the trained model using the testing data.

Let's import back the data that have been cleaned in previous section.


In [1]:
import pandas as pd

automobile_cleaned = pd.read_csv('automobile_cleaned.csv')

automobile_cleaned.head()

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,price,city-L/100km,highway-L/100km
0,3,122.0,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111.0,5000.0,13495.0,11.190476,8.703704
1,3,122.0,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111.0,5000.0,16500.0,11.190476,8.703704
2,1,122.0,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154.0,5000.0,16500.0,12.368421,9.038462
3,2,164.0,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102.0,5500.0,13950.0,9.791667,7.833333
4,2,164.0,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115.0,5500.0,17450.0,13.055556,10.681818


### **<em style='color:darkred'>4.1 Data Preparation</em>**

#### **<em style='color:darkred'>4.1.1 Data Encoding</em>**
First is **data encoding**, there are many categorical data that appears as text and these columns must be transformed to numerical. One of the way is to perform data encoding, that encodes the text into numerical values.

##### **<span style='color:darkred'>One-Hot Encoding</span>**
**One-Hot Encoding** is a technique used in machine learning and data preprocessing to convert categorical variables into numerical format that can be fed into algorithms. It's particularly useful for features with many unique categories.

In [None]:
def get_one_hot(data, col):
    res = pd.get_dummies(data[col], dtype=int)
    res.astype(int)
    return res

In [None]:
categorical_columns = automobile_cleaned.dtypes[(automobile_cleaned.dtypes=='object')]
categorical_columns = categorical_columns.index.values
print(categorical_columns)

In [None]:
automobile_prep = automobile_cleaned.copy()

for col in categorical_columns:
    one_hot = get_one_hot(automobile_prep, col)
    automobile_prep = pd.concat([automobile_prep, one_hot], axis=1)
    automobile_prep.drop(col, axis=1, inplace=True)

In [None]:
automobile_features = automobile_prep.drop('price', axis=1)
automobile_features

In [None]:
automobile_target = automobile_prep['price']
automobile_target

In [None]:
from sklearn.preprocessing import LabelEncoder

automobile_prep_le = automobile_cleaned.copy()
automobile_prep_le.drop('price', axis=1, inplace=True)

encoder = LabelEncoder()

for col in categorical_columns:
    automobile_prep_le[col] = encoder.fit_transform(automobile_prep_le[col])

In [None]:
automobile_prep_le

#### **<em style='color:darkred'>4.1.2 Data Normalization</em>**
Now we have all the numerical values in our dataset, however, the scale of the values can be varies in scale and that could lead to bias. Thus in this case, we will extract the features (except the target column `price` and normalize them using standard scaler) Here we will use the `scikit-learn` library.

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
automobile_features_scaled = scaler.fit_transform(automobile_features)

In [None]:
automobile_features_scaled

In [None]:
scaler = StandardScaler()
automobile_features_le_scaled = scaler.fit_transform(automobile_prep_le)

In [None]:
automobile_features_le_scaled

### **<em style='color:darkred'>4.2 Data Splitting</em>**
Before start the model training, we need to split our data into Training and Testing Set. Training set is the data that used by the algorithm to search for interesting patterns that able to be used to classify the differentypes of species.

**Steps:**
1. Load the `model_selection` module from the `scikit-learn` library
2. Then split the data into train and test set using the `train_test_split()` function.  The input parameters for the function is as follows:
   
   `train_test_split( iris_data, target, test_size=0.2, random_state=1 )`
   
   where `test_size` refers to the percentage of the testing set (0.2 --> 20% test 80% train)
4. This function will return 4 variables, which are train data, test data, train labels, test labels respectively.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(automobile_features_le_scaled, automobile_target, test_size=0.3, random_state=1)

In [None]:
X_train.shape

In [None]:
X_test.shape

### **<em style='color:darkred'>4.3 Model Training</em>**

Now, using the same dataset, let's continue and work further on Step 5-6 and train different classification models using different methods.
- Decision tree
- Random forest
- K-nearest neighbours
#### **<em style='color:darkred'>4.3.1 Decision Tree</em>**
To train a Decision Tree classification model in `python`, first need to import the corresponding library of `DecisionTreeRegressor()` from the `tree` module in `sklearn` library

In [None]:
from sklearn.tree import DecisionTreeRegressor

dt_model = DecisionTreeRegressor()
dt_model.fit(X_train, Y_train)

#### **<em style='color:darkred'>4.3.2 Random Forest</em>**
To train a Decision Tree classification model in `python`, first need to import the corresponding library of `RandomForestRegressor()` from the `ensemble` module in `sklearn` library

In [None]:
from sklearn.ensemble import RandomForestRegressor

rf_model = RandomForestRegressor()
rf_model.fit(X_train, Y_train)

#### **<em style='color:darkred'>4.3.3 K-Nearest Neighbour</em>**
To train a Decision Tree classification model in `python`, first need to import the corresponding library of `KNeighborsRegressor()` from the `neighbors` module in `sklearn` library

In [None]:
from sklearn.neighbors import KNeighborsRegressor

knn_model = KNeighborsRegressor()
knn_model.fit(X_train, Y_train)

### **<em style='color:darkred'>4.4 Model Evaluation</em>**
To test the trained model in `python`, we will use the variable and call the `predict()` function and pass in the testing data (just the testing data, without the target). This will generate a list of predicted target values (the price) for the testing data from the model.

In [None]:
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error
import math

def get_evaluation(model,name, Xtest, Ytest):

    prediction = model.predict(Xtest)
    error = mean_squared_error(Ytest, prediction)
    print("Mean Squared Error: ", error)
    print("Root Mean Squared Error: ", math.sqrt(error))

    fig = plt.figure(figsize=(10,6))
    ax  = fig.add_subplot(111)
    plt.plot(range(61), Ytest, label='true')
    plt.plot(range(61), prediction, label='predicted')

    ax.legend()
    ax.set_xlabel('test samples')
    ax.set_ylabel('price')
    ax.set_title("Comparison of Predicted vs Truth for "+name+" model")

    return error

In [None]:
dt_error = get_evaluation(dt_model,"Decision Tree", X_test, Y_test)

In [None]:
rf_error = get_evaluation(rf_model,"Random Forest", X_test, Y_test)

In [None]:
knn_error = get_evaluation(knn_model,"K-Nearest Neighbour", X_test, Y_test)