Skip to content

osmanveyetkin/Linear-Regression-Practice-Using-Python

Repository files navigation

Linear Regression Practice Using Python

This study involves practical application of regression analysis, as covered in the Google Advanced Data Analytics course.

For more information, you may refer to the official course materials provided by Coursera:
Google Advanced Data Analytics Certificate

Call İmports

Begin by importing the relevant packages and data.

# Import packages
import pandas as pd
import seaborn as sns

Note: Loading the dataset and inspecting the first few rows using the head() function

# Load dataset
penguins = sns.load_dataset("penguins")

# Examine first 5 rows of dataset
penguins.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 Male
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 Female
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 Female
3 Adelie Torgersen NaN NaN NaN NaN NaN
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 Female

From the first 5 rows of the dataset, we can see that there are several columns available: species, island, bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, and sex. There also appears to be some missing data.

Data cleaning

From the first 5 rows of the dataset, we can see that there are several columns available: species, island, bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, and sex. There also appears to be some missing data. We know from the course materials that this dataset contains information about penguins and includes some missing values that need to be handled during data preprocessing.

# Keep Adelie and Gentoo penguins, drop missing values
penguins_sub = penguins[penguins["species"] != "Chinstrap"]
penguins_final = penguins_sub.dropna()
penguins_final.reset_index(inplace=True, drop=True)

Exploratory data analysis

Before you construct any model, it is important to get more familiar with your data. You can do so by performing exploratory data analysis or EDA. Please review previous program materials as needed if you would like to refamiliarize yourself with EDA concepts.

Since this part of the course focuses on simple linear regression, you want to check for any linear relationships among variables in the dataframe. You can do this by creating scatterplots using any data visualization package, for example matplotlib.plt, seaborn, or plotly.

To visualize more than one relationship at the same time, we use the pairplot() function from the seaborn package to create a scatterplot matrix.

# Create pairwise scatterplots of data set
sns.pairplot(penguins_final)
<seaborn.axisgrid.PairGrid at 0x1117fad80>


png

From the scatterplot matrix, you can observe a few linear relationships:

  • bill length (mm) and flipper length (mm)
  • bill length (mm) and body mass (g)
  • flipper length (mm) and body mass (g)

Model construction

Based on the above scatterplots, you could probably run a simple linear regression on any of the three relationships identified. For this part of the course, you will focus on the relationship between bill length (mm) and body mass (g).

To do this, you will first subset the variables of interest from the dataframe. You can do this by using double square brackets [[]], and listing the names of the columns of interest.

# Subset Data
ols_data = penguins_final[["bill_length_mm", "body_mass_g"]]

Next, you can construct the linear regression formula, and save it as a string. Remember that the y or dependent variable comes before the ~, and the x or independent variables comes after the ~.

Note: The names of the x and y variables have to exactly match the column names in the dataframe.

# Write out formula
ols_formula = "body_mass_g ~ bill_length_mm"

Lastly, you can build the simple linear regression model in statsmodels using the ols() function. You can import the ols() function directly using the line of code below.

# Import ols function
from statsmodels.formula.api import ols

Then, you can plug in the ols_formula and ols_data as arguments in the ols() function. After you save the results as a variable, you can call on the fit() function to actually fit the model to the data.

# Build OLS, fit model to data
OLS = ols(formula = ols_formula, data = ols_data)
model = OLS.fit()

Lastly, you can call the summary() function on the model object to get the coefficients and more statistics about the model. The output from model.summary() can be used to evaluate the model and interpret the results. Later in this section, we will go over how to read the results of the model output.

model.summary()
OLS Regression Results
Dep. Variable: body_mass_g R-squared: 0.769
Model: OLS Adj. R-squared: 0.768
Method: Least Squares F-statistic: 874.3
Date: Wed, 07 May 2025 Prob (F-statistic): 1.33e-85
Time: 17:38:08 Log-Likelihood: -1965.8
No. Observations: 265 AIC: 3936.
Df Residuals: 263 BIC: 3943.
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept -1707.2919 205.640 -8.302 0.000 -2112.202 -1302.382
bill_length_mm 141.1904 4.775 29.569 0.000 131.788 150.592
Omnibus: 2.060 Durbin-Watson: 2.067
Prob(Omnibus): 0.357 Jarque-Bera (JB): 2.103
Skew: 0.210 Prob(JB): 0.349
Kurtosis: 2.882 Cond. No. 357.


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

You can use the regplot() function from seaborn to visualize the regression line.

sns.regplot(x = "bill_length_mm", y = "body_mass_g", data = ols_data)
<Axes: xlabel='bill_length_mm', ylabel='body_mass_g'>


png

Finish checking model assumptions

# Subset X variable
X = ols_data["bill_length_mm"]

# Get predictions from model
fitted_values = model.predict(X)

Then, you can save the model residuals as a variable by using the model.resid attribute.

# Calculate residuals
residuals = model.resid

Check the normality assumption

To check the normality assumption, you can create a histogram of the residuals using the histplot() function from the seaborn package.

From the below histogram, you may notice that the residuals are almost normally distributed. In this case, it is likely close enough that the assumption is met.

import matplotlib.pyplot as plt
fig = sns.histplot(residuals)
fig.set_xlabel("Residual Value")
fig.set_title("Histogram of Residuals")
plt.show()


png

Another way to check the normality function is to create a quantile-quantile or Q-Q plot. Recall that if the residuals are normally distributed, you would expect a straight diagonal line going from the bottom left to the upper right of the Q-Q plot. You can create a Q-Q plot by using the qqplot function from the statsmodels.api package.

The Q-Q plot shows a similar pattern to the histogram, where the residuals are mostly normally distributed, except at the ends of the distribution.

import matplotlib.pyplot as plt
import statsmodels.api as sm
fig = sm.qqplot(model.resid, line = 's')
plt.show()


png

Check the homoscedasticity assumption

Lastly, we have to check the homoscedasticity assumption. To check the homoscedasticity assumption, you can create a scatterplot of the fitted values and residuals. If the plot resembles a random cloud (i.e., the residuals are scattered randomly), then the assumption is likely met.

You can create one scatterplot by using the scatterplot() function from the seaborn package. The first argument is the variable that goes on the x-axis. The second argument is the variable that goes on the y-axis.

# Import matplotlib
import matplotlib.pyplot as plt
fig = sns.scatterplot(x=fitted_values, y=residuals)

# Add reference line at residuals = 0
fig.axhline(0)

# Set x-axis and y-axis labels
fig.set_xlabel("Fitted Values")
fig.set_ylabel("Residuals")

# Show the plot
plt.show()


png

This project was developed alongside the Google Advanced Data Analytics course and focuses on practical implementation of linear regression. Leveraging the dataset and methodologies introduced in the course, it walks through data preprocessing, visualization, and model building using Python. The study complements the course content by reinforcing concepts through hands-on application.

About

Linear Regression Practice Using Python

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published