<a href="https://colab.research.google.com/github/irfixq/Avocado_King/blob/main/Avocado_King.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Avocado King - Avocado Price & Sales Prediction

## System Configuration

In [None]:
import sys #access to system parameters https://docs.python.org/3/library/sys.html

import numpy as np # linear algebra

import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import matplotlib # collection of functions for scientific and publication-ready visualization
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns # visualizing distributions data
from scipy import stats # visualizing probability distribution of statistical function

import warnings # ignore warnings
warnings.filterwarnings('ignore')

from google.colab import data_table # to show full data table in multiple pages
%load_ext google.colab.data_table
pd.set_option('max_rows', 30000)

In [None]:
## Check system and python dependencies version
print("Python version: {}". format(sys.version))
print("NumPy version: {}". format(np.__version__))
print("pandas version: {}". format(pd.__version__))
print("matplotlib version: {}". format(matplotlib.__version__))


There are 2 options to get the data, either from GitHub or Google Drive.
In this case, I prefer to use Git Clone since it will be easier for user to access the repo instead of loading everything into their Google Drive or local.

In [None]:
## Clone repo from GitHub
! git clone 'https://github.com/irfixq/Avocado_King'

In [None]:
## Mount Google Drive to get data
## make sure you uploaded the folder into your Google Drive first

#from google.colab import drive 
#drive.mount('/content/drive')

In [None]:
## get working directory
! pwd

## list all folders in working directory
! ls

In [None]:
## change working directory to github folder
import os
os.chdir('/content/Avocado_King')


In [None]:
## check working directory after change path
! pwd
! ls

## Data Pre-Processing

1. Data pre-processing for price and sales data
> * Read dataset  as pandas dataframe
> * Check for df dimension (rows, columns)
> * Check for column names and datatype
> * Show raw dataset table
> * Check for missing values
> * Handle missing values (if any)


2. Data Distribution
3. Data pre-processing for Google search data



#### Data pre-processing for price and sales data

In [None]:
## see the shape of the dataset (rows, columns)
df_price = pd.read_csv('/content/Avocado_King/price-and-sales-data.csv')
df_price.shape

In [None]:
## list all column names
df_price.columns

In [None]:
## checking data type of each column
df_price.dtypes

In [None]:
## see the dataset
df_price.head()

In [None]:
## check for missing values in dataset
print(f"Missing data:{df_price.isna().sum(axis=0).any()}") # TRUE represents the dataset has missing data

In [None]:
## see distribution of missing values in heat map
sns.heatmap(df_price.isna(),cmap='Greens')

Based on heatmap above, 
the dark marks represent missing values in our dataset. Column 'Date', 'type', 'year', 'region' does not have any missing values.

In [None]:
## See the missing data in dataset
df_price_NA_check = df_price.isna()
df_price_NA_check.head(10)

In [None]:
## Save as new .csv table to see whole data / for download
df_price_NA_check.to_csv('df_price_NA_check.csv',sep=',')

In [None]:
## there are 2 option to handle missing data
# option 1 = eliminate data point that contain missing values (not recommended as you might missed important data for other attribute)
# option 2 = substitue missing value with avg value of the attribute

dfnew_price = df_price.fillna(df_price.mean())
dfnew_price.head(10)

In [None]:
## check for dataset after impute missing values
print(f"Missing data:{dfnew_price.isna().sum(axis=0).any()}")  # FALSE represent there is no missing values anymore in the dataset

In [None]:
## see new dataset after substitue missing values / for download
dfnew_price.to_csv('dfnew_price.csv',sep=',')

In [None]:
## Check for outliers
outliers = dfnew_price.describe()
outliers

Based on above table, there is no outliers within the dataset because all mean values lie in between min and max values of the distribution.


In [None]:
## Save as new .csv file / for download
outliers.to_csv('outliers.csv',sep=',')

In [None]:
## Check for duplicate values in dataset
print('Duplicated values = ',sum(dfnew_price.duplicated()))

In [None]:
## Checking each features of the cleaned dataset
dfnew_price.info()

In [None]:
## see distribution of cleaned dataset in heat map
sns.heatmap(dfnew_price.isna(),cmap='Greens')

**Conclusion after data pre-processing**
* Features = 13
* Instances = 25,161
* No duplicate values
* No null values after imputing the missing values with mean of the attribute itself
* Features with datatype = 'object' could be the machine learning classifier which are; 'type' & 'region'

#### Data Distribution
To understand how the variables are distributed. 

##### Visualizing Data Distribution

In [None]:
f, ax = plt.subplots(nrows=2, ncols=1, figsize=(12,10))
# Univriate distribution plot
# by default the kernel density estimation is TRUE; to see continuous density by smoothing the observation using Gaussian kernel fx
sns.distplot(dfnew_price.AveragePrice, color='green', ax=ax[0])
# Box plot
sns.boxplot(dfnew_price.AveragePrice, color='green',ax=ax[1])

## see probability distribution of avg price
f, ax = plt.subplots(nrows=1, ncols=1, figsize=(12,5))
# Probability distribution
stats.probplot(dfnew_price['AveragePrice'], plot=ax)
plt.show()

**Conclusion from visualizing the data distribution**
* All 3 plots above showed bimodal distribution which telling us that we have 2 local maximum.
* As discussed earlier in pre-processing section, our potential classifier could be Type and Region which in this case Type has 2 class (Organic & Conventional)

In [None]:
# Bivariate distribution plot of average price against total volume for each avocade type (class: Organic & Conventional)
sns.displot(dfnew_price, x='TotalVolume', y='AveragePrice',hue='type',height=10)

In [None]:
# Bivariate distribution plot of average price against total volume for all region
sns.displot(dfnew_price, x='TotalVolume', y='AveragePrice',hue='region',height=10)

**Conclusion from Bivariate Distribution plot**
* Based on the bivariate distribution plot above, we can see that more conventional avocado has been sold compared to organic avocado.
* Also, organic avocado was selling at higher price compared to conventional avocado.

**Skewness**
* Measure of the asymmetry of the probability distribution of a random variable about its mean. In other words, skewness tells you the amount and direction of skew (departure from horizontal symmetry).

In [None]:
print("Skewness: %f" % dfnew_price['AveragePrice'].skew())

* If skewness is 0, the data are perfectly symmetrical, although it is quite unlikely for real-world data.
* If skewness is less than -1 or greater than 1, the distribution is highly skewed.
***If skewness is between -1 and -0.5 or between 0.5 and 1, the distribution is moderately skewed.**
* If skewness is between -0.5 and 0.5, the distribution is approximately symmetric.

**Kurtosis**
* Measure the heaviness of distribution tails w.r.t. skewness.

In [None]:
## Kurtosis: Measure heaviness of the distribution tails
print("Kurtosis: %f" % dfnew_price['AveragePrice'].kurt())

* Positive value means more data in the tail of the distribution.
* Excess kurtosis = kurtosis - 3 = -2.442; which represent that we have lighter tail than normal distribution.

##### Classifiers

We have identified potential classifiers to be 'type' and 'region'

In [None]:
df_conv_avo = dfnew_price[dfnew_price['type'] == 'conventional']
print("Conventional Avocado = ",df_conv_avo.shape)

df_org_avo = dfnew_price[dfnew_price['type'] == 'organic']
print("Organic Avocado = ",df_org_avo.shape)

In [None]:
## Create histogram to see data distribution of both class in 'type'

f, ax = plt.subplots(nrows=1, ncols=1, figsize=(15, 7))
sns.distplot(df_conv_avo['AveragePrice'],color='brown') # conventional avocado
sns.distplot(df_org_avo['AveragePrice'],color='darkgreen') # organic avocado
plt.show()

In [None]:
## Calculate 'Measure of Spread' for AveragePrice of the CONVENTIONAL avocado dataset
df_conv_avo['AveragePrice'].describe()

In [None]:
## Calculate 'Measure of Spread' for AveragePrice of the ORGANIC avocado dataset
df_org_avo['AveragePrice'].describe()

Visualize the 'Measure of Spread' calculated for AveragePrice of both class using Boxplot function.
https://www.statisticshowto.com/probability-and-statistics/descriptive-statistics/box-plot/

In [None]:
f, ax = plt.subplots(nrows=1, ncols=1, figsize=(8,8))
sns.boxplot(x='type',y='AveragePrice',data=dfnew_price,palette='Greens',showmeans=True)
plt.show()

# Can also visualize in combined boxplot by running below code
f, ax = plt.subplots(nrows=1, ncols=1, figsize=(10,8))
sns.boxplot(df_conv_avo['AveragePrice'],color='brown')
sns.boxplot(df_org_avo['AveragePrice'],color='darkgreen')
plt.show()

In [None]:
conventional = dfnew_price['type']=='conventional'
organic = dfnew_price['type']=='organic'

In [None]:
## Average price of CONVENTIONAL avocado in each year by region
conv_price_byyear_byregion = sns.factorplot('AveragePrice','region',data=dfnew_price[conventional],hue='year',size=10,palette='bright',join=False)

In [None]:
## Average price of ORGANIC avocado in each year by region
organic_price_byyear_byregion = sns.factorplot('AveragePrice','region',data=dfnew_price[organic],hue='year',size=10,palette='bright',join=False)

In [None]:
## Average price of conventional avocado by region (average from year 2015-2019)
conventional_factorplot = sns.factorplot('AveragePrice','region',data=dfnew_price[conventional],color='brown',size=10,join=False,)

In [None]:
## Average price of organic avocado by region (average from year 2015-2019)
organic_factorplot = sns.factorplot('AveragePrice','region',data=dfnew_price[organic],color='darkgreen',size=10,join=False)

**Conclusion**
* Not only class 'type' that affect the average price but also 'region'.
* 


## Feature Extraction

In [None]:
#corrmat = dfnew_price.corr()
f, ax = plt.subplots(nrows=1, ncols=1, figsize=(14, 10))
#ax.set_title("Correlation Matrix", fontsize=20)
sns.heatmap(dfnew_price.corr(), vmin=-1, vmax=1, cmap='RdYlGn', annot=True)

**Conclusion from feature extraction**
* TotalVolume and TotalBags show strongest correlation.

## Training

In [None]:
from sklearn.model_selection import train_test_split

X = dfnew_price.drop(['AveragePrice','Date','type','region'],1)  # drop date column as model training split unable to convert Date to float (not suitable)
y = dfnew_price['AveragePrice']  # dependent variable
print('Shape of dataset = ', X.shape, y.shape)


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
print('Shape of training dataset = ', X_train.shape, y_train.shape)
print('Shape of test dataset = ', X_test.shape, y_test.shape)

In [None]:
X.head(10)

## Prediction Machine Learning Model

In [None]:
from sklearn import linear_model
lm = linear_model.LinearRegression()

In [None]:
model = lm.fit(X_train, y_train)
predictions = lm.predict(X_test)

predictions

In [None]:
## Plotting model

plt.scatter(y_test, predictions)
plt.xlabel('AveragePrice')
plt.ylabel('Predictions')

In [None]:
## Check for accuracy
print ('Accuracy score of Linear Regression model = ', model.score(X_test, y_test))

**another model**

https://www.kaggle.com/mruanova/predict-avocado-prices-using-linear-regression/notebook#Step-6-Missing-Data

In [None]:
from scipy import stats

X = dfnew_price.year
y = dfnew_price['AveragePrice']

slope, intercept, r, p, std_err = stats.linregress(X, y) # scipy

def modelPrediction(x):
  return slope * x + intercept

model = list(map(modelPrediction, X)) # scipy

In [None]:
X_pred = 2020
y_pred = modelPrediction(X_pred)
print('Model Prediction of CONVENTIONAL avocado AveragePrice in 2020')
avocado_price = round(y_pred, 2)
print('$ {} USD'.format(avocado_price))

In [None]:
f, ax = plt.subplots(nrows=1, ncols=1, figsize=(12,8))

plt.scatter(X, y,color='green',) # Scatter Plot
plt.plot(X, model, color='red')
plt.ylim(ymin=0) # starts at zero
plt.xticks(np.arange(min(X), max(X)+1))
plt.legend(['Model Prediction using Linear Regression', 'Conventional Avocado Prices (2015-2018)'])
plt.show()

In [None]:
model = lm.fit(X_train, y_train)

In [None]:
## Check for accuracy
print ('Accuracy score of Linear Regression model = ', model.score(X_pred, y_pred))

## Validation
Our dataset is considered small enough and no single split is can give satisfactory variance in estimation. Hence, cross-validation of data is proposed.

In [None]:
from sklearn.model_selection import KFold 
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn import metrics

In [None]:
X_train, X_val, y_train, y_val = train_test_split(X_train,y_train, test_size=0.20)
print('Shape of training dataset = ', X_train.shape, y_train.shape)
print('Shape of test dataset = ', X_val.shape, y_val.shape)

In [None]:
kf = KFold(n_splits=10) # Define the split - into 2 folds 
kf.get_n_splits(X) # returns the number of splitting iterations in the cross-validator
print(kf) 

In [None]:
model = lm.fit(X_train, y_train)
predictions_val = lm.predict(X_val)

predictions_val

In [None]:
## Plotting model

plt.scatter(y_val, predictions_val)
plt.xlabel('AveragePrice')
plt.ylabel('Predictions')

In [None]:
## Check for accuracy
print ('Accuracy score of Linear Regression model = ', model.score(X_val, y_val))

**Conclusion**
* Model accuracy has improved after do cross-validation.