# Linear Regression Model that Accurately Predicts the Price of Apples

<img src="predict.jpeg" height=400>

# Table Of Contents
### 1. [Introduction](#introduction)

    1. Objective
    2. Parameters
    3. Outline
    
### 2. [Importing Data and Plotting](#import)

    1. Import necessary packages
    2. Import the data into a Pandas Dataframe
    3. Show the data
    4. Make ean initial plot of the data

### 3. [Exploratory Data Analysis](#explore)
### 4. [Split Data: Testing and training](#split)
### 5. [Outliers](#outliers)
### 6. [Regression Model](#regress)
    1. Taking estimates
    2. Least squares
    3. sklearn
### 7. [Conclusion](#conclude)

## 1. Introduction <a name="introduction"></a>

### 1.1. Objective 
In this notebook we will design a regression model that will predict the cost of apples based on given parameters.

### 1.2. Parameters
1. Month/Season
2. Distance travelled
3. Supplier cost
4. Grade of apple
5. Demand and Supply
6. Container used?


## 2. Importing Data and Plotting <a name="import"></a>

#### 2.1. Import necessary packages

In [89]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
import statsmodels as sm
import sklearn as skl

#### 2.2. Import the data into a Pandas Dataframe

In [90]:
sample_submission = pd.DataFrame(pd.read_csv("sample_submission.csv"))
test_set = pd.DataFrame(pd.read_csv("df-test_set.csv"))
train_set = pd.DataFrame(pd.read_csv("df-train_set.csv"))

#### 2.3. Show the data 

In [91]:
#Viewing the first five rows of our train_set dataframe.
train_set.head() 

Unnamed: 0,Province,Container,Size_Grade,Weight_Kg,Commodities,Date,Low_Price,High_Price,Sales_Total,Total_Qty_Sold,Total_Kg_Sold,Stock_On_Hand,avg_price_per_kg
0,CAPE,EC120,1L,12.0,APPLE GRANNY SMITH,2020-03-10,108.0,112.0,3236.0,29,348.0,0,9.3
1,CAPE,M4183,1L,18.3,APPLE GOLDEN DELICIOUS,2020-09-09,150.0,170.0,51710.0,332,6075.6,822,8.51
2,GAUTENG,AT200,1L,20.0,AVOCADO PINKERTON,2020-05-05,70.0,80.0,4860.0,66,1320.0,50,3.68
3,TRANSVAAL,BJ090,1L,9.0,TOMATOES-LONG LIFE,2020-01-20,60.0,60.0,600.0,10,90.0,0,6.67
4,WESTERN FREESTATE,PP100,1R,10.0,POTATO SIFRA (WASHED),2020-07-14,40.0,45.0,41530.0,927,9270.0,393,4.48


In [92]:
#The dataframe has 64376 rows and 13 columns.
train_set.shape

(64376, 13)

In [93]:
#The info method displays the nature of our data i.e datatypes and non-null count.
train_set.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64376 entries, 0 to 64375
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Province          64376 non-null  object 
 1   Container         64376 non-null  object 
 2   Size_Grade        64376 non-null  object 
 3   Weight_Kg         64376 non-null  float64
 4   Commodities       64376 non-null  object 
 5   Date              64376 non-null  object 
 6   Low_Price         64376 non-null  float64
 7   High_Price        64376 non-null  float64
 8   Sales_Total       64376 non-null  float64
 9   Total_Qty_Sold    64376 non-null  int64  
 10  Total_Kg_Sold     64376 non-null  float64
 11  Stock_On_Hand     64376 non-null  int64  
 12  avg_price_per_kg  64376 non-null  float64
dtypes: float64(6), int64(2), object(5)
memory usage: 6.4+ MB


The info summary above shows 64376 entries and it has the following data types: six float type data, two integer type data, five object type data. All columns showing zero null values.

In [94]:
#Summary statistic of each column in the dataframe.
train_set.describe()

Unnamed: 0,Weight_Kg,Low_Price,High_Price,Sales_Total,Total_Qty_Sold,Total_Kg_Sold,Stock_On_Hand,avg_price_per_kg
count,64376.0,64376.0,64376.0,64376.0,64376.0,64376.0,64376.0,64376.0
mean,12.781592,75.651938,89.607858,19395.01,446.104402,3336.641295,477.646328,
std,35.943052,159.508144,172.223177,44421.92,1184.169758,7682.295441,1453.892091,
min,0.12,1.0,1.0,-57700.0,-595.0,-5040.0,-512.0,-inf
25%,7.0,30.0,35.0,1154.0,20.0,175.0,0.0,4.02
50%,10.0,46.0,55.0,5400.0,107.0,940.0,76.0,6.0
75%,11.0,80.0,100.0,18772.0,390.0,3250.0,381.0,8.67
max,500.0,4400.0,4400.0,1134701.0,39453.0,192230.0,93193.0,inf


In [95]:
#Viewing the first five rows of our test_set dataframe.
test_set.head()

Unnamed: 0,Index,Province,Container,Size_Grade,Weight_Kg,Commodities,Date,Low_Price,High_Price,Sales_Total,Total_Qty_Sold,Total_Kg_Sold,Stock_On_Hand
0,1,W.CAPE-BERGRIVER ETC,EC120,1M,12.0,APPLE GOLDEN DELICIOUS,2020-07-09,128.0,136.0,5008.0,38,456.0,0
1,2,W.CAPE-BERGRIVER ETC,M4183,1X,18.3,APPLE GOLDEN DELICIOUS,2020-01-20,220.0,220.0,1760.0,8,146.4,2
2,3,W.CAPE-BERGRIVER ETC,EC120,1S,12.0,APPLE GOLDEN DELICIOUS,2020-08-19,120.0,120.0,720.0,6,72.0,45
3,4,W.CAPE-BERGRIVER ETC,M4183,1M,18.3,APPLE GOLDEN DELICIOUS,2020-05-06,160.0,160.0,160.0,1,18.3,8
4,5,W.CAPE-BERGRIVER ETC,M4183,1L,18.3,APPLE GOLDEN DELICIOUS,2020-05-04,140.0,160.0,14140.0,100,1830.0,19


In [96]:
#The dataframe has 685 rows and 13 columns.
test_set.shape

(685, 13)

In [97]:
#The info method displays the nature of our data i.e datatypes and non-null count.
test_set.info() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 685 entries, 0 to 684
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Index           685 non-null    int64  
 1   Province        685 non-null    object 
 2   Container       685 non-null    object 
 3   Size_Grade      685 non-null    object 
 4   Weight_Kg       685 non-null    float64
 5   Commodities     685 non-null    object 
 6   Date            685 non-null    object 
 7   Low_Price       685 non-null    float64
 8   High_Price      685 non-null    float64
 9   Sales_Total     685 non-null    float64
 10  Total_Qty_Sold  685 non-null    int64  
 11  Total_Kg_Sold   685 non-null    float64
 12  Stock_On_Hand   685 non-null    int64  
dtypes: float64(5), int64(3), object(5)
memory usage: 69.7+ KB


The info summary above shows 685 entries and it has the following data types: five float type data, three integer type data, five object type data. All columns showing zero null values.

In [98]:
#Summary statistic of each column in the dataframe.
test_set.describe()

Unnamed: 0,Index,Weight_Kg,Low_Price,High_Price,Sales_Total,Total_Qty_Sold,Total_Kg_Sold,Stock_On_Hand
count,685.0,685.0,685.0,685.0,685.0,685.0,685.0,685.0
mean,343.0,34.142482,164.202891,195.590073,18788.111212,174.883212,2725.402336,439.245255
std,197.886752,87.575995,355.167319,389.109476,33951.586813,299.351142,5059.123311,715.985761
min,1.0,3.0,10.0,10.0,10.0,1.0,6.3,0.0
25%,172.0,9.0,50.0,64.0,1300.0,13.0,204.0,20.0
50%,343.0,12.0,80.0,112.0,5520.0,62.0,860.1,153.0
75%,514.0,18.3,128.0,160.0,21176.0,200.0,3033.0,516.0
max,685.0,400.0,2400.0,2400.0,308010.0,2774.0,47200.0,6827.0


After veiwing our data, we viewed the sample submission as well to confirm our response variable as the column to feed the submission file on Kaggle.

In [99]:
sample_submission.head()

Unnamed: 0,Index,avg_price_per_kg
0,1,13.94
1,2,1.3


#### 2.4. Make an initial plot of the data

## 3. Exploratory Data Analysis <a name="explore"></a>
### 3.1. Explore the data shape and types
### Look for null values
Give data descriptions
### Is the data univariate or multivariate?
### Determine kurtosis and skew
### Consider the distribution of the data
### Look for correlation of multivariate data 
## 4. Split the data between training data and testing data <a name="split"></a>
## Check for linearity, multicollinearity, independence, homoscedasticity, normality
### Homoscedasticity
Do the magnitude of the risiduals increase as the fitted data increases? This will result in a cone shape and that is called heteroscedasticity. We don’t want that.
### Normality
Draw a histogram of the normalized residuals and look for a bell curve around zero.
Draw a QQ plot of the residuals
## 5. Check for outliers in residuals <a name="outliers"></a>
### Plot Cook’s distance
## 6. Build the Regression Model <a name="regress"></a>
Consider a treemodel
### Results if we follow method 1: Taking Estimates
#### Show our calculations
#### Plot our results
#### Assess our results
### Results if we follow method 2: Least Squared Method
#### Show our calculations
#### Plot our results
#### Assess our results
### Results if we build our model using sklearn:
#### Show our calculations
#### Plot our results
#### Assess our results


## 7. Conclusion <a name="conclude"></a>

### What we accomplished. 
### What we learnt.
