In [1]:
import pandas as pd
from scipy import stats
import my_statistics as mystats

In [2]:
df = pd.read_csv("capstone_data_clean.csv")
df["Adv_Year"] = df["Adv_Year"].astype("category")
df["Adv_Month"] = df["Adv_Month"].astype("category")
df["Gearbox_Type"] = df["Gearbox_Type"].astype("category")
df["Fuel_Type"] = df["Fuel_Type"].astype("category")

In [3]:
df.dtypes

Maker             object
Genmodel          object
Genmodel_ID       object
Adv_Year        category
Adv_Month       category
Color             object
Body_Type         object
Gearbox_Type    category
Fuel_Type       category
Reg_Year         float64
Mileage          float64
Engine_Size      float64
Price            float64
Engine_Power     float64
Annual_Tax       float64
Wheelbase        float64
Height           float64
Width            float64
Length           float64
Average_Mpg      float64
Top_Speed        float64
Seat_Num         float64
Door_Num         float64
dtype: object

In [4]:
df.head()

Unnamed: 0,Maker,Genmodel,Genmodel_ID,Adv_Year,Adv_Month,Color,Body_Type,Gearbox_Type,Fuel_Type,Reg_Year,...,Engine_Power,Annual_Tax,Wheelbase,Height,Width,Length,Average_Mpg,Top_Speed,Seat_Num,Door_Num
0,Bentley,Arnage,10_1,2018,Apr,Silver,Saloon,Automatic,Petrol,2000.0,...,,,3116.0,1515.0,2125.0,5390.0,,,5.0,4.0
1,Bentley,Arnage,10_1,2018,Jun,Grey,Saloon,Automatic,Petrol,2002.0,...,450.0,315.0,3116.0,1515.0,2125.0,5390.0,13.7,179.0,5.0,4.0
2,Bentley,Arnage,10_1,2017,Nov,Blue,Saloon,Automatic,Petrol,2002.0,...,400.0,315.0,3116.0,1515.0,2125.0,5390.0,14.7,155.0,5.0,4.0
3,Bentley,Arnage,10_1,2018,Apr,Green,Saloon,Automatic,Petrol,2003.0,...,,,3116.0,1515.0,2125.0,5390.0,,,5.0,4.0
4,Bentley,Arnage,10_1,2017,Nov,Grey,Saloon,Automatic,Petrol,2003.0,...,,,3116.0,1515.0,2125.0,5390.0,,,5.0,4.0


# Characteristic of the Price variable

In [5]:
df["Price"].describe()

count    2.662350e+05
mean     1.471670e+04
std      2.594508e+04
min      1.000000e+02
25%      4.990000e+03
50%      9.295000e+03
75%      1.711100e+04
max      2.599990e+06
Name: Price, dtype: float64

In [6]:
df["Price"].mode()

0    3995.0
Name: Price, dtype: float64

In [7]:
df["Price"].skew()

30.194572570312697

### Price normal distribution test:
> H0: the variable is normally distributed
> H1: the variable is not normally distributed

In [8]:
stats.normaltest(df["Price"], nan_policy="omit")

NormaltestResult(statistic=696368.0014545948, pvalue=0.0)

### Conclusion
1. There's a reason to reject H0 in favor of H1 and assume that the distribution of prices is different than normal.
2. Because of that I have to use non-parametric tests for calculations.

## Numeric features' influence on Price

### Spearman correlations
I choose a Spearman R correlation, because:
1. It's a non-parametric, rank-based test. 
2. there's no assumptions about the distributions or data, except that variables should be at least ordinal. 

In [9]:
mystats.spearman_assess(df, "Price")    # It takes around 1 minute

Unnamed: 0,Feature,Spearman_R,P_Value,Strength
0,Reg_Year,0.736614,0.0,Strong
1,Mileage,-0.666629,0.0,Moderate
2,Engine_Power,0.582099,0.0,Moderate
3,Length,0.483369,0.0,Moderate
4,Top_Speed,0.459417,0.0,Moderate
5,Width,0.455132,0.0,Moderate
6,Wheelbase,0.442061,0.0,Moderate
7,Engine_Size,0.440565,0.0,Moderate
8,Height,0.156611,0.0,Weak
9,Seat_Num,0.038742,6.09802e-89,


## Categorical features' influence on Price

### Due to the not-normally distributed price, I decided to use Kruskal-Wallis H-test for categorical features.

Implementation of that test in Python requires the categories to be in a numerical format, so I created a function to convert values.

In [10]:
mystats.kruskal_all(df, "Price")

Unnamed: 0,Feature,Kruskal-Wallis_H,P_Value
0,Maker,400353.805028,0.0
1,Genmodel,390654.420981,0.0
2,Genmodel_ID,390679.03042,0.0
3,Adv_Year,439784.583095,0.0
4,Adv_Month,401705.923938,0.0
5,Color,401135.020731,0.0
6,Body_Type,403927.393802,0.0
7,Gearbox_Type,416197.051149,0.0
8,Fuel_Type,412060.534283,0.0


## Conclusions

According to the tests, all the categorical features affect prices, and the relationships are statistically significant.
As for the numerical features, there's 1 strong, statistically significant correlation, and 7 of moderate strength.

I need to select only 5 of them for this project, so I decide to go with 2 categorical and 3 numerical features:
1. Maker
2. Fuel Type
3. Registration Year
4. Mileage
5. Engine Power

In case of categorical variables, the decision is rather arbitrary, but the numeric variables I select based on the correlation strength.