<a href="https://colab.research.google.com/github/hazrakeruboO/DS-Colabs/blob/main/Copy_of_Python_Programming_Ridge_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<font color="green">*To start working on this notebook, or any other notebook that we will use in the Moringa Data Science Course, we will need to save our own copy of it. We can do this by clicking File > Save a Copy in Drive. We will then be able to make edits to our own copy of this notebook.*</font>

# Python Programming: Ridge Regression

## 1.0 Example 

In [None]:
# Example 
# ---
# Regularization is the process of penalizing coefficients of variables either by removing them and or reducing their impact. 
# Ridge regression reduces the effect of problematic variables close to zero but never fully removes them. 
# ---
# Question: Build a regrssion model to predict expenses based on the variables available.
# ---
# Dataset source: Pydataset Library: VietNamI Dataset
# ---
#

In [None]:
# Importing our libraries
# 
import numpy as np
import pandas as pd
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

In [None]:
# installing !pip install pydataset and importing pydataset so as to use a dataset from the package
# 
!pip install pydataset
from pydataset import data 

Collecting pydataset
  Downloading pydataset-0.2.0.tar.gz (15.9 MB)
[K     |████████████████████████████████| 15.9 MB 117 kB/s 
Building wheels for collected packages: pydataset
  Building wheel for pydataset (setup.py) ... [?25l[?25hdone
  Created wheel for pydataset: filename=pydataset-0.2.0-py3-none-any.whl size=15939430 sha256=52d0115c3c8473ea10780a6c06362b1c565ea64982f6ed443a10f5deb0971e2f
  Stored in directory: /root/.cache/pip/wheels/32/26/30/d71562a19eed948eaada9a61b4d722fa358657a3bfb5d151e2
Successfully built pydataset
Installing collected packages: pydataset
Successfully installed pydataset-0.2.0
initiated datasets repo at: /root/.pydataset/


In [None]:
# Data Preparation
# 

# Loading the data and convert the sex variable to a dummy variable
#
df = pd.DataFrame(data('VietNamI'))
df.loc[df.sex== 'male', 'sex'] = 0
df.loc[df.sex== 'female','sex'] = 1
df['sex'] = df['sex'].astype(int)

# Setting up our X and y datasets
#
X = df[['pharvis','age','sex','married','educ','illness','injury','illdays','actdays','insurance']]
y = df['lnhhexp']

In [None]:
# Creating our baseline regression model
# This is a model that has no regularization to it
# 
regression = LinearRegression()
regression.fit(X,y)
first_model = (mean_squared_error(y_true=y,y_pred=regression.predict(X)))
print(first_model)

# The output  value of 0.355289 will be our indicator to determine if the regularized ridge regression model is superior or not.

0.35528915032173053


In [None]:
# In order to create our ridge model we need to first determine the most appropriate value for the l2 regularization. 
# L2 is the name of the hyperparameter that is used in ridge regression. 
# Determining the value of a hyperparameter requires the use of a grid. 
# In the code below, we first create our ridge model and indicate normalization in order to get better estimates. 
# Next we setup the grid that we will use. 
# The search object has several arguments within it. Alpha is hyperparameter we are trying to set. 
# The log space is the range of values we want to test. 
# We want the log of -5 to 2, but we only get 8 values from within that range evenly spread out. 
# Are metric is the mean squared error. Refit set true means to adjust the parameters while modeling 
# and cv is the number of folds to develop for the cross-validation. 
#
ridge = Ridge(normalize=True)
search = GridSearchCV(estimator=ridge,param_grid={'alpha':np.logspace(-5,2,8)},scoring='neg_mean_squared_error',n_jobs=1,refit=True,cv=10)

In [None]:
# We now use the .fit function to run the model and then use the .best_params_ and
#  .best_scores_ function to determine the models strength. 
# 
search.fit(X,y)
search.best_params_
{'alpha': 0.01}
abs(search.best_score_) 

# The best_params_ tells us what to set alpha too which in this case is 0.01. 
# The best_score_ tells us what the best possible mean squared error is. 
# In this case, the value of 0.38 is worse than what the baseline model was. 

In [None]:
# We can confirm this by fitting our model with the ridge information and finding the mean squared error below
#
ridge = Ridge(normalize=True,alpha=0.01)
ridge.fit(X,y)
second_model = (mean_squared_error(y_true=y,y_pred=ridge.predict(X)))
print(second_model)

In [None]:
# The 0.35 is lower than the 0.38. This is because the last results are not cross-validated. 
# In addition, these results indicate that there is little difference between the ridge and baseline models. 
# This is confirmed with the coefficients of each model found below.
# 
coef_dict_baseline = {}
for coef, feat in zip(regression.coef_,data("VietNamI").columns):
    coef_dict_baseline[feat] = coef
coef_dict_baseline

# The coefficient values are about the same. This means that the penalization made little difference with this dataset.

## 2.0 Challenges

### <font color="green">Challenge 1</font>

In [None]:
# Challenge 1 
# ---
# Question: Build an accurate model that can estimate the weight of fish given the following dataset.
# ---
# Dataset url = http://bit.ly/FishDataset
# ---
# 
data=pd.read_csv('http://bit.ly/FishDataset')
data.head()


Unnamed: 0,Species,Weight,Length1,Length2,Length3,Height,Width
0,Bream,242.0,23.2,25.4,30.0,11.52,4.02
1,Bream,290.0,24.0,26.3,31.2,12.48,4.3056
2,Bream,340.0,23.9,26.5,31.1,12.3778,4.6961
3,Bream,363.0,26.3,29.0,33.5,12.73,4.4555
4,Bream,430.0,26.5,29.0,34.0,12.444,5.134


In [None]:
data.tail()

Unnamed: 0,Species,Weight,Length1,Length2,Length3,Height,Width
154,Smelt,12.2,11.5,12.2,13.4,2.0904,1.3936
155,Smelt,13.4,11.7,12.4,13.5,2.43,1.269
156,Smelt,12.2,12.1,13.0,13.8,2.277,1.2558
157,Smelt,19.7,13.2,14.3,15.2,2.8728,2.0672
158,Smelt,19.9,13.8,15.0,16.2,2.9322,1.8792


In [None]:
data['Species'].unique()

array(['Bream', 'Roach', 'Whitefish', 'Parkki', 'Perch', 'Pike', 'Smelt'],
      dtype=object)

In [None]:
data.shape

(159, 7)

In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159 entries, 0 to 158
Data columns (total 7 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   Species  159 non-null    object 
 1   Weight   159 non-null    float64
 2   Length1  159 non-null    float64
 3   Length2  159 non-null    float64
 4   Length3  159 non-null    float64
 5   Height   159 non-null    float64
 6   Width    159 non-null    float64
dtypes: float64(6), object(1)
memory usage: 8.8+ KB


In [None]:
data.describe()

Unnamed: 0,Weight,Length1,Length2,Length3,Height,Width
count,159.0,159.0,159.0,159.0,159.0,159.0
mean,398.326415,26.24717,28.415723,31.227044,8.970994,4.417486
std,357.978317,9.996441,10.716328,11.610246,4.286208,1.685804
min,0.0,7.5,8.4,8.8,1.7284,1.0476
25%,120.0,19.05,21.0,23.15,5.9448,3.38565
50%,273.0,25.2,27.3,29.4,7.786,4.2485
75%,650.0,32.7,35.5,39.65,12.3659,5.5845
max,1650.0,59.0,63.4,68.0,18.957,8.142


In [None]:
data.duplicated().sum()

0

In [None]:
data.isnull().sum()

Species    0
Weight     0
Length1    0
Length2    0
Length3    0
Height     0
Width      0
dtype: int64

In [None]:
Y=data['Weight'].pop
X=data.copy

In [None]:
X.drop['']

In [None]:
Y

<bound method Series.pop of 0       242.0
1       290.0
2       340.0
3       363.0
4       430.0
5       450.0
6       500.0
7       390.0
8       450.0
9       500.0
10      475.0
11      500.0
12      500.0
13      340.0
14      600.0
15      600.0
16      700.0
17      700.0
18      610.0
19      650.0
20      575.0
21      685.0
22      620.0
23      680.0
24      700.0
25      725.0
26      720.0
27      714.0
28      850.0
29     1000.0
30      920.0
31      955.0
32      925.0
33      975.0
34      950.0
35       40.0
36       69.0
37       78.0
38       87.0
39      120.0
40        0.0
41      110.0
42      120.0
43      150.0
44      145.0
45      160.0
46      140.0
47      160.0
48      169.0
49      161.0
50      200.0
51      180.0
52      290.0
53      272.0
54      390.0
55      270.0
56      270.0
57      306.0
58      540.0
59      800.0
60     1000.0
61       55.0
62       60.0
63       90.0
64      120.0
65      150.0
66      140.0
67      170.0
68      145.0
69    

In [None]:
X

<bound method NDFrame.copy of        Species  Weight  Length1  Length2  Length3   Height   Width
0        Bream   242.0     23.2     25.4     30.0  11.5200  4.0200
1        Bream   290.0     24.0     26.3     31.2  12.4800  4.3056
2        Bream   340.0     23.9     26.5     31.1  12.3778  4.6961
3        Bream   363.0     26.3     29.0     33.5  12.7300  4.4555
4        Bream   430.0     26.5     29.0     34.0  12.4440  5.1340
5        Bream   450.0     26.8     29.7     34.7  13.6024  4.9274
6        Bream   500.0     26.8     29.7     34.5  14.1795  5.2785
7        Bream   390.0     27.6     30.0     35.0  12.6700  4.6900
8        Bream   450.0     27.6     30.0     35.1  14.0049  4.8438
9        Bream   500.0     28.5     30.7     36.2  14.2266  4.9594
10       Bream   475.0     28.4     31.0     36.2  14.2628  5.1042
11       Bream   500.0     28.7     31.0     36.2  14.3714  4.8146
12       Bream   500.0     29.1     31.5     36.4  13.7592  4.3680
13       Bream   340.0     29.5 

In [None]:

from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.2,random_state=0)

TypeError: ignored

In [None]:
# i didnot hot encode species,i will get a wrong answer
from sklearn.linear_model import LinearRegression
L=LinearRegression()

In [None]:
L.fit(X_train,Y_train)

### <font color="green">Challenge 2</font>

In [None]:
# Challenge 2
# ---
# Question: Build a regression algorithm for predicting unemployment within an economy.
# ---
# Dataset url = http://bit.ly/EconomicDataset
# ---
# Dataset Info
# 1. date. Month of data collection
# 2. psavert, personal savings rate
# 3. pce, personal consumption expenditures, in billions of dollars
# 4. unemploy, number of unemployed in thousands 
# 5. empmed, median duration of unemployment, in week
# 6. pop, total population, in thousands
# ---
# 
employment=pd.read_csv('http://bit.ly/EconomicDataset')
employment.head()

Unnamed: 0.1,Unnamed: 0,date,pce,pop,psavert,uempmed,unemploy
0,1,1967-06-30,507.8,198712,9.8,4.5,2944
1,2,1967-07-31,510.9,198911,9.8,4.7,2945
2,3,1967-08-31,516.7,199113,9.0,4.6,2958
3,4,1967-09-30,513.3,199311,9.8,4.9,3143
4,5,1967-10-31,518.5,199498,9.7,4.7,3066


In [None]:
employment.tail()

Unnamed: 0.1,Unnamed: 0,date,pce,pop,psavert,uempmed,unemploy
473,474,2006-11-30,9478.5,301070,-1.1,7.3,6849
474,475,2006-12-31,9540.3,301296,-0.9,8.1,7017
475,476,2007-01-31,9610.6,301481,-1.0,8.1,6865
476,477,2007-02-28,9653.0,301684,-0.7,8.5,6724
477,478,2007-03-31,9705.0,301913,-1.3,8.7,6801


In [None]:
employment.shape

(478, 7)

In [None]:
employment.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 478 entries, 0 to 477
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  478 non-null    int64  
 1   date        478 non-null    object 
 2   pce         478 non-null    float64
 3   pop         478 non-null    int64  
 4   psavert     478 non-null    float64
 5   uempmed     478 non-null    float64
 6   unemploy    478 non-null    int64  
dtypes: float64(3), int64(3), object(1)
memory usage: 26.3+ KB


In [None]:
employment.describe()

Unnamed: 0.1,Unnamed: 0,pce,pop,psavert,uempmed,unemploy
count,478.0,478.0,478.0,478.0,478.0,478.0
mean,239.5,3654.230962,246348.939331,6.72113,7.124059,6997.177824
std,138.130976,2609.656755,30126.735749,3.476889,1.640329,1859.035642
min,1.0,507.8,198712.0,-3.0,4.0,2685.0
25%,120.25,1272.45,220094.25,4.0,5.8,6052.5
50%,239.5,3082.45,242515.5,7.6,6.9,7187.5
75%,358.75,5474.15,272277.25,9.5,8.375,8250.25
max,478.0,9705.0,301913.0,14.6,12.3,12051.0


In [None]:
employment.duplicated().sum()

0

In [None]:
employment.isnull().sum()

Unnamed: 0    0
date          0
pce           0
pop           0
psavert       0
uempmed       0
unemploy      0
dtype: int64

### <font color="green">Challenge 3</font>

In [None]:
# Challenge 3
# ---
# Question: Build a regression model to predict the life expectancy of a country. 
# Apply ridge regression to your model.
# ---
# Dataset url = http://bit.ly/LifeExpectancyDataset
# ---
# Dataset Info:
# Country: Country
# Year: Year
# Status: Developed or Developing status
# Life expectancy: Life Expectancy in age
# Adult Mortality: Adult Mortality Rates of both sexes (probability of dying between 15 and 60 years per 1000 population)
# infant deaths: Number of Infant Deaths per 1000 population
# Alcohol: Alcohol, recorded per capita (15+) consumption (in litres of pure alcohol)
# percentage expenditure: Expenditure on health as a percentage of Gross Domestic Product per capita(%)
# Hepatitis B: Hepatitis B (HepB) immunization coverage among 1-year-olds (%)
# Measles: Measles: number of reported cases per 1000 population
# BMI: Average Body Mass Index of entire population
# under-five: deaths Number of under-five deaths per 1000 population
# Polio: Polio (Pol3) immunization coverage among 1-year-olds (%)
# Total expenditure: General government expenditure on health as a percentage of total government expenditure (%)
# Diphtheria: Diphtheria tetanus toxoid and pertussis (DTP3) immunization coverage among 1-year-olds (%)
# HIV/AIDS: Deaths per 1 000 live births HIV/AIDS (0-4 years)
# GDP: Gross Domestic Product per capita (in USD)
# Population: Population of the country
# thinness 1-19 years: Prevalence of thinness among children and adolescents for Age 10 to 19 (% )
# thinness 5-9 years: Prevalence of thinness among children for Age 5 to 9(%)
# Income composition of resources: Human Development Index in terms of income composition of resources (index ranging from 0 to 1)
# Schooling: Number of years of Schooling(years)
# ---
# 
OUR CODE GOES HERE

### <font color="green">Challenge 4</font>

In [None]:
# Challenge 4
# ---
# Question: Given the beauty dataset below, create a regression model to predict wages upon applying ridge regression.
# ---
# Dataset url = http://bit.ly/BeautyDataset
# ---
# 
OUR CODE GOES HERE

### <font color="green">Challenge 5</font>

In [None]:
# Challenge 5
# ---
# Create a regression model to predict sales prices. 
# Apply regularization techniques.
# ---
# Dataset source = http://bit.ly/HousePricesDataset
# ---
# 
OUR CODE GOES HERE