# Simple Linear Regression (Gapminder Project)

In this project you are going to work with __gapminder__ dataset, which tracks economic and social indicators like population, life expectancy and the GDP per capita of countries over time. For more information about Gapminder, visit the this [link](https://www.gapminder.org/data/)

This is a guided project, which I will guide you through to proceed with this small project. I believe this method will prepare you for your own future projects.

In order to do this project, you may need to refer to this [tutorial](https://github.com/DrSaadLa/PythonTuts/blob/main/ML%20with%20Python/02.01.%20Linear%20Regression%20with%20Python%20(Part%2001)%20Solution.ipynb)

### Import Necessary Modules
1. import pandas
2. import numpy 
3. import seaborn
4. import matplotlib.pyplot

In [None]:
import numpy as np 
import pandas as pd 
import seaborn as sns 
import matplotlib.pyplot as plt 
%matplotlib inline

### Import the dataset

the dataset can be downloaded from [here](https://raw.githubusercontent.com/DrSaadLa/PythonTuts/main/Data/gapminder.csv). 


In [None]:
# Here is the url provided for you
url = "https://raw.githubusercontent.com/DrSaadLa/PythonTuts/main/Data/gapminder.csv"

In [None]:
# use pd.read_csv() to import the data
gapminder = pd.read_csv(url)

In [None]:
# Check the first few obs
gapminder.head()

In [None]:
# Check the last few obs
gapminder.tail()

### Checking The Data Information
the dataset __gapminder__ contains :
- 139 observations(rows)
- 10 variables (columns), where 9 numercial variables, and one categoriacal variable(__Region__)
- No Missing values 
- No Duplicated Rows 
- it's unbalanced dataset since the standard deviation are differents __std(population)= 109,512,100 whereas  std(fertility) = 1.615  and std(HIV) = 4.4__


In [None]:
# check the data information
gapminder.info()

In [None]:
# checking duplicated rows 
gapminder.duplicated().sum()

In [None]:
# Run desriptive statistics
gapminder.describe().T

## Select Target and Feature Variable

This is a simple linear regression, so we are going to use only two variables. Suppose you wish to predict life expectancy in a given country using one variable such as GDP, fertility rate, or population. 

Before selecting the condidate input variable, we will plot a heatmap on the correlation matrix of dataset, then we select to highly correlated variable with the target, which will be __life__

### This section is done for you. 

In [None]:
# Setting the figure size 
sns.set(rc={'figure.figsize':(10,10)})
sns.heatmap(gapminder.corr() , annot = True , cmap='RdYlBu', square=True)

From the heatmap, the highly negatively correlated variable is __fertility__, __r = -0.79__ which the one you are going to use to build your model. So:(__child_mortality__ is the highly correlated variable with the __life__ variable , __r = -0.87__ : strong negative relationship)

Target is: __life__

Input is: __fertility__

In [None]:
# pairplot 
sns.pairplot(gapminder)

In [None]:
# Rename life as y
y = gapminder['life']

In [None]:
# Rename the input variable as X
X = gapminder['fertility']

In [None]:
# Check the shape of y
print("The shape of the target variable is :",y.shape)

In [None]:
# Check the shape of X
print("The shape of the input variable is :" , X.shape)

As we have seen in the lecture, we have to reshape a 1D array into a 2D array using reshape() function 

In [None]:
# reshape the target variable y
y_reshaped = np.array(y).reshape(-1,1)

In [None]:
# reshape the input variable X
X_reshaped = np.array(X).reshape(-1,1)

In [None]:
# print the new shape y
y_reshaped.shape

In [None]:
# print the new shape of X
X_reshaped.shape

### Plotting

Plot a scatter plot of the variables life and fertility.

In [None]:
# plot scatter plot 
plt.scatter(gapminder['life'] , gapminder['fertility']  )
plt.xlabel("life" , fontsize = 15)
plt.ylabel("fertility", fontsize = 15)
plt.title("Scatter Plot of Life Vs Fertility" , fontsize = 25 )

Ovely a fitted line on the plot using `lmplot` from seaborn package. 

In [None]:
# plot linear regression plot.
sns.lmplot(x ='life', y = 'fertility' , height = 10 , data = gapminder)

### Building a  Linear Regression Model

1. Import LinearRegression from sklearn
2. Create an lm object
3. Fit the model
4. Print the model parameters
5. Print the score of the model
6. predict on the same data

In [None]:
# Import LinearRegression() from sklearn.linear_model
from sklearn.linear_model import LinearRegression

In [None]:
# Create lm object
lm_gapminder = LinearRegression()

In [None]:
# fit the linear model
lm_gapminder.fit(X_reshaped , y_reshaped)

In [None]:
# Print the intercept 
print("The model intercept is : " , lm_gapminder.intercept_)

In [None]:
# Print the coef 
print("The model parameter is: " , lm_gapminder.coef_)

In [None]:
# Print the Godness-of-fit metric 
print("The coefficient of determination is: " , lm_gapminder.score(X_reshaped , y_reshaped))
print("The fertility explain approximatly 62 % of the variation of Life ")

In [None]:
# Predict on the data
y_pred = lm_gapminder.predict(X_reshaped)

In [None]:
# Plot the fitted line on top of the scatter plot
plt.scatter(X_reshaped , y_reshaped , color = "red")
plt.plot(X_reshaped , y_pred , color = "green" , linewidth = 4)
plt.xlabel("Life" , fontsize = 15)
plt.ylabel("fertility", fontsize = 15)
plt.title("comparison between the fitted line and the scatter plot of the original data  " , fontsize = 25)