Linear Regression Practice Using Python

This study involves practical application of regression analysis, as covered in the Google Advanced Data Analytics course.

For more information, you may refer to the official course materials provided by Coursera:
Google Advanced Data Analytics Certificate

Call İmports

Begin by importing the relevant packages and data.

# Import packages
import pandas as pd
import seaborn as sns

Note: Loading the dataset and inspecting the first few rows using the head() function

# Load dataset
penguins = sns.load_dataset("penguins")

# Examine first 5 rows of dataset
penguins.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex
0	Adelie	Torgersen	39.1	18.7	181.0	3750.0	Male
1	Adelie	Torgersen	39.5	17.4	186.0	3800.0	Female
2	Adelie	Torgersen	40.3	18.0	195.0	3250.0	Female
3	Adelie	Torgersen	NaN	NaN	NaN	NaN	NaN
4	Adelie	Torgersen	36.7	19.3	193.0	3450.0	Female

From the first 5 rows of the dataset, we can see that there are several columns available: species, island, bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, and sex. There also appears to be some missing data.

Data cleaning

From the first 5 rows of the dataset, we can see that there are several columns available: species, island, bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, and sex. There also appears to be some missing data. We know from the course materials that this dataset contains information about penguins and includes some missing values that need to be handled during data preprocessing.

# Keep Adelie and Gentoo penguins, drop missing values
penguins_sub = penguins[penguins["species"] != "Chinstrap"]
penguins_final = penguins_sub.dropna()
penguins_final.reset_index(inplace=True, drop=True)

Exploratory data analysis

Before you construct any model, it is important to get more familiar with your data. You can do so by performing exploratory data analysis or EDA. Please review previous program materials as needed if you would like to refamiliarize yourself with EDA concepts.

Since this part of the course focuses on simple linear regression, you want to check for any linear relationships among variables in the dataframe. You can do this by creating scatterplots using any data visualization package, for example matplotlib.plt, seaborn, or plotly.

To visualize more than one relationship at the same time, we use the pairplot() function from the seaborn package to create a scatterplot matrix.

# Create pairwise scatterplots of data set
sns.pairplot(penguins_final)

<seaborn.axisgrid.PairGrid at 0x1117fad80>

From the scatterplot matrix, you can observe a few linear relationships:

bill length (mm) and flipper length (mm)
bill length (mm) and body mass (g)
flipper length (mm) and body mass (g)

Model construction

Based on the above scatterplots, you could probably run a simple linear regression on any of the three relationships identified. For this part of the course, you will focus on the relationship between bill length (mm) and body mass (g).

To do this, you will first subset the variables of interest from the dataframe. You can do this by using double square brackets [[]], and listing the names of the columns of interest.

# Subset Data
ols_data = penguins_final[["bill_length_mm", "body_mass_g"]]

Next, you can construct the linear regression formula, and save it as a string. Remember that the y or dependent variable comes before the ~, and the x or independent variables comes after the ~.

Note: The names of the x and y variables have to exactly match the column names in the dataframe.

# Write out formula
ols_formula = "body_mass_g ~ bill_length_mm"

Lastly, you can build the simple linear regression model in statsmodels using the ols() function. You can import the ols() function directly using the line of code below.

# Import ols function
from statsmodels.formula.api import ols

Then, you can plug in the ols_formula and ols_data as arguments in the ols() function. After you save the results as a variable, you can call on the fit() function to actually fit the model to the data.

# Build OLS, fit model to data
OLS = ols(formula = ols_formula, data = ols_data)
model = OLS.fit()

Lastly, you can call the summary() function on the model object to get the coefficients and more statistics about the model. The output from model.summary() can be used to evaluate the model and interpret the results. Later in this section, we will go over how to read the results of the model output.

model.summary()

OLS Regression Results

Dep. Variable:	body_mass_g	R-squared:	0.769
Model:	OLS	Adj. R-squared:	0.768
Method:	Least Squares	F-statistic:	874.3
Date:	Wed, 07 May 2025	Prob (F-statistic):	1.33e-85
Time:	17:38:08	Log-Likelihood:	-1965.8
No. Observations:	265	AIC:	3936.
Df Residuals:	263	BIC:	3943.
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	-1707.2919	205.640	-8.302	0.000	-2112.202	-1302.382
bill_length_mm	141.1904	4.775	29.569	0.000	131.788	150.592

Omnibus:	2.060	Durbin-Watson:	2.067
Prob(Omnibus):	0.357	Jarque-Bera (JB):	2.103
Skew:	0.210	Prob(JB):	0.349
Kurtosis:	2.882	Cond. No.	357.

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

You can use the regplot() function from seaborn to visualize the regression line.

sns.regplot(x = "bill_length_mm", y = "body_mass_g", data = ols_data)

<Axes: xlabel='bill_length_mm', ylabel='body_mass_g'>

Finish checking model assumptions

# Subset X variable
X = ols_data["bill_length_mm"]

# Get predictions from model
fitted_values = model.predict(X)

Then, you can save the model residuals as a variable by using the model.resid attribute.

# Calculate residuals
residuals = model.resid

Check the normality assumption

To check the normality assumption, you can create a histogram of the residuals using the histplot() function from the seaborn package.

From the below histogram, you may notice that the residuals are almost normally distributed. In this case, it is likely close enough that the assumption is met.

import matplotlib.pyplot as plt
fig = sns.histplot(residuals)
fig.set_xlabel("Residual Value")
fig.set_title("Histogram of Residuals")
plt.show()

Another way to check the normality function is to create a quantile-quantile or Q-Q plot. Recall that if the residuals are normally distributed, you would expect a straight diagonal line going from the bottom left to the upper right of the Q-Q plot. You can create a Q-Q plot by using the qqplot function from the statsmodels.api package.

The Q-Q plot shows a similar pattern to the histogram, where the residuals are mostly normally distributed, except at the ends of the distribution.

import matplotlib.pyplot as plt
import statsmodels.api as sm
fig = sm.qqplot(model.resid, line = 's')
plt.show()

Check the homoscedasticity assumption

Lastly, we have to check the homoscedasticity assumption. To check the homoscedasticity assumption, you can create a scatterplot of the fitted values and residuals. If the plot resembles a random cloud (i.e., the residuals are scattered randomly), then the assumption is likely met.

You can create one scatterplot by using the scatterplot() function from the seaborn package. The first argument is the variable that goes on the x-axis. The second argument is the variable that goes on the y-axis.

# Import matplotlib
import matplotlib.pyplot as plt
fig = sns.scatterplot(x=fitted_values, y=residuals)

# Add reference line at residuals = 0
fig.axhline(0)

# Set x-axis and y-axis labels
fig.set_xlabel("Fitted Values")
fig.set_ylabel("Residuals")

# Show the plot
plt.show()

This project was developed alongside the Google Advanced Data Analytics course and focuses on practical implementation of linear regression. Leveraging the dataset and methodologies introduced in the course, it walks through data preprocessing, visualization, and model building using Python. The study complements the course content by reinforcing concepts through hands-on application.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
NotebookWork.ipynb		NotebookWork.ipynb
README.md		README.md
output_14_1.png		output_14_1.png
output_28_1.png		output_28_1.png
output_35_0.png		output_35_0.png
output_37_0.png		output_37_0.png
output_40_0.png		output_40_0.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Linear Regression Practice Using Python

Call İmports

Data cleaning

Exploratory data analysis

Model construction

Finish checking model assumptions

Check the normality assumption

Check the homoscedasticity assumption

About

Uh oh!

Releases

Packages

Languages

osmanveyetkin/Linear-Regression-Practice-Using-Python

Folders and files

Latest commit

History

Repository files navigation

Linear Regression Practice Using Python

Call İmports

Data cleaning

Exploratory data analysis

Model construction

Finish checking model assumptions

Check the normality assumption

Check the homoscedasticity assumption

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages