## Linear and Polynomial Regression for Pumpkin Pricing

## Preparation

* When is the best time to buy pumpkins?
* What price can I expect of a case of miniature pumpkins?
* Should I buy them in half-bushel baskets or by the 1 1/9 bushel box? Let's keep digging into this data


Load up required libraries and dataset. Convert the data to a dataframe containing a subset of the data:

- Only get pumpkins priced by the bushel
- Convert the date to a month
- Calculate the price to be an average of high and low prices
- Convert the price to reflect the pricing by bushel quantity


In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

pumpkins = pd.read_csv('US-pumpkins.csv')

pumpkins.head()

Unnamed: 0,City Name,Type,Package,Variety,Sub Variety,Grade,Date,Low Price,High Price,Mostly Low,...,Unit of Sale,Quality,Condition,Appearance,Storage,Crop,Repack,Trans Mode,Unnamed: 24,Unnamed: 25
0,BALTIMORE,,24 inch bins,,,,4/29/17,270.0,280.0,270.0,...,,,,,,,E,,,
1,BALTIMORE,,24 inch bins,,,,5/6/17,270.0,280.0,270.0,...,,,,,,,E,,,
2,BALTIMORE,,24 inch bins,HOWDEN TYPE,,,9/24/16,160.0,160.0,160.0,...,,,,,,,N,,,
3,BALTIMORE,,24 inch bins,HOWDEN TYPE,,,9/24/16,160.0,160.0,160.0,...,,,,,,,N,,,
4,BALTIMORE,,24 inch bins,HOWDEN TYPE,,,11/5/16,90.0,100.0,90.0,...,,,,,,,N,,,


> create a Regression model to see if you can predict which package of pumpkins will have the best pumpkin prices.

In [4]:
from sklearn.preprocessing import LabelEncoder

pumpkins = pumpkins[pumpkins['Package'].str.contains(
    'bushel', case=True, regex=True)]

new_columns = ['Package', 'Variety', 'City Name',
               'Month', 'Low Price', 'High Price', 'Date']

pumpkins = pumpkins.drop(
    [c for c in pumpkins.columns if c not in new_columns], axis=1)

price = (pumpkins['Low Price'] + pumpkins['High Price']) / 2

month = pd.DatetimeIndex(pumpkins['Date']).month

new_pumpkins = pd.DataFrame({'Month': month, 'Variety': pumpkins['Variety'], 'City': pumpkins['City Name'],
                            'Package': pumpkins['Package'], 'Low Price': pumpkins['Low Price'], 'High Price': pumpkins['High Price'], 'Price': price})

new_pumpkins.loc[new_pumpkins['Package'].str.contains(
    '1 1/9'), 'Price'] = price/1.1

new_pumpkins.loc[new_pumpkins['Package'].str.contains(
    '1/2'), 'Price'] = price*2

new_pumpkins.iloc[:, 0:-1] = new_pumpkins.iloc[:,
                                               0:-1].apply(LabelEncoder().fit_transform)

new_pumpkins


Unnamed: 0,Month,Variety,City,Package,Low Price,High Price,Price
70,1,3,1,0,5,3,13.636364
71,1,3,1,0,10,7,16.363636
72,2,3,1,0,10,7,16.363636
73,2,3,1,0,9,6,15.454545
74,2,3,1,0,5,3,13.636364
...,...,...,...,...,...,...,...
1738,1,1,9,2,5,3,30.000000
1739,1,1,9,2,3,3,28.750000
1740,1,1,9,2,0,3,25.750000
1741,1,1,9,2,1,0,24.000000


### Finding a good correlation between two point of  your data to potentially build a good predictive model

In [6]:
print(new_pumpkins['Variety'].corr(new_pumpkins['Price']))
print(new_pumpkins['Package'].corr(new_pumpkins['Price']))

-0.8634790400214399


## Building a linear model

> Before building your model, do one more tidy-up of your data. Drop any null data and check once more what the data looks like.

In [7]:
new_pumpkins.dropna(inplace=True)
new_pumpkins.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 415 entries, 70 to 1742
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Month       415 non-null    int64  
 1   Variety     415 non-null    int32  
 2   City        415 non-null    int32  
 3   Package     415 non-null    int32  
 4   Low Price   415 non-null    int64  
 5   High Price  415 non-null    int64  
 6   Price       415 non-null    float64
dtypes: float64(1), int32(3), int64(3)
memory usage: 21.1 KB


In [8]:
new_columns = ['Package', 'Price']
lin_pumpkins = new_pumpkins.drop(
    [c for c in new_pumpkins.columns if c not in new_columns], axis='columns')
lin_pumpkins

Unnamed: 0,Package,Price
70,0,13.636364
71,0,16.363636
72,0,16.363636
73,0,15.454545
74,0,13.636364
...,...,...
1738,2,30.000000
1739,2,28.750000
1740,2,25.750000
1741,2,24.000000


> Now you can assign your X and y coordinate data:

In [None]:
X = lin_pumpkins.values[:, :1]
y = lin_pumpkins.values[:, 1:2]


> Next, start the regression model-building routines:

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=0)
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)

pred = lin_reg.predict(X_test)

accuracy_score = lin_reg.score(X_train, y_train)
print('Model Accuracy: ', accuracy_score)


You can visualize the line that's drawn in the process

In [None]:
plt.scatter(X_test, y_test,  color='black')
plt.plot(X_test, pred, color='blue', linewidth=3)

plt.xlabel('Package')
plt.ylabel('Price')

plt.show()

Test the model against a hypothetical variety:

In [None]:
lin_reg.predict(np.array([[2.75]]))

# Polynomial regression
✅ Polynomials are mathematical expressions that might consist of one or more variables and coefficients
* While sometimes there's a linear relationship between variables - the bigger the pumpkin in volume, the higher the price - sometimes these relationships can't be plotted as a plane or straight line.

In [None]:
new_columns = ['Variety', 'Package', 'City', 'Month', 'Price']
poly_pumpkins = new_pumpkins.drop(
    [c for c in new_pumpkins.columns if c not in new_columns], axis='columns')

poly_pumpkins


A good way to visualize the correlations between data in dataframes is to display it in a 'coolwarm' chart:

Use the Background_gradient() method with coolwarm as its argument value:

In [None]:
corr = poly_pumpkins.corr()
corr.style.background_gradient(cmap='coolwarm')
corr

## Create a pipeline

Scikit-learn includes a helpful API for building polynomial regression models - the make_pipeline API. A 'pipeline' is created which is a chain of estimators. In this case, the pipeline includes polynomial features, or predictions that form a nonlinear path.

1. Build out the X and y columns

In [None]:
X = poly_pumpkins.iloc[:, 3:4].values
y = poly_pumpkins.iloc[:, 4:5].values


Create the pipeline by calling the make_pipeline() method:

In [None]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline

pipeline = make_pipeline(PolynomialFeatures(4), LinearRegression())

pipeline

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=0)

pipeline.fit(np.array(X_train), y_train)

y_pred = pipeline.predict(X_test)


Let's check the model's accuracy

In [None]:
accuracy_score = pipeline.score(X_train, y_train)
print('Model Accuracy: ', accuracy_score)


And voila!
Do a prediction
Can we input a new value and get a prediction?

Call predict() to make a prediction:

In [None]:

pipeline.predict(np.array([[2.75]]))
