In [None]:
# Import necessary libraries
import pandas as pd
from scipy import stats

# Load the new dataset
df = pd.read_csv('keepdata.csv')

# Display the first few rows of the dataframe
df.head()

print(df.head())

print(df)

In [None]:
# Remove rows with negative rates
df = df[df['rates'] >= 0]

# Display the first few rows of the dataframe after removing negative rates
df.head()

In [None]:
# Calculate z-scores
df['z_score']=stats.zscore(df['rates'])

# Remove outliers: keep only the ones that are within +3 to -3 standard deviations in the column 'rates'.
df_no_outliers = df[(df['z_score'] > -3) & (df['z_score'] < 3)]

# Display the first few rows of the dataframe after removing outliers
df_no_outliers.head()

In [None]:
# Import necessary libraries for plotting
import matplotlib.pyplot as plt
import seaborn as sns

# Plot ages versus rates for the original dataframe
plt.figure(figsize=(10, 6))
sns.scatterplot(x='ages', y='rates', data=df)
plt.title('Ages vs Rates (with outliers)')
plt.show()

# Box and whisker plot for the original dataframe
plt.figure(figsize=(10, 6))
sns.boxplot(x='ages', y='rates', data=df)
plt.title('Box plot of Ages vs Rates (with outliers)')
plt.show()

In [None]:
# Plot ages versus rates for the dataframe without outliers
plt.figure(figsize=(10, 6))
sns.scatterplot(x='ages', y='rates', data=df_no_outliers)
plt.title('Ages vs Rates (without outliers)')
plt.show()

# Box and whisker plot for the dataframe without outliers
plt.figure(figsize=(10, 6))
sns.boxplot(x='ages', y='rates', data=df_no_outliers)
plt.title('Box plot of Ages vs Rates (without outliers)')
plt.show()

In [None]:
# Import necessary libraries for linear regression
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Define the predictor variable and target variable
X = df_no_outliers[['ages']]
y = df_no_outliers['rates']

# Split the data into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a linear regression model and fit it to the training data
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions using the test set
y_pred = model.predict(X_test)

# The coefficients
print('Coefficients: \n', model.coef_)
# The mean squared error
print('Mean squared error: %.2f' % mean_squared_error(y_test, y_pred))
# The coefficient of determination: 1 is perfect prediction
print('Coefficient of determination: %.2f' % r2_score(y_test, y_pred))

# Plot outputs
plt.scatter(X_test, y_test,  color='black')
plt.plot(X_test, y_pred, color='blue', linewidth=3)

plt.show()

In [None]:
# Calculate residuals
residuals = y_test - y_pred

# Scatter plot of predicted vs. residuals
plt.figure(figsize=(10, 6))
sns.scatterplot(x=y_pred, y=residuals)
plt.xlabel('Predicted')
plt.ylabel('Residuals')
plt.title('Residuals vs Predicted Values')
plt.show()

# Histogram of residuals
plt.figure(figsize=(10, 6))
sns.histplot(residuals, kde=True)
plt.title('Distribution of Residuals')
plt.show()

## Model Evaluation

The linear regression model was trained using 'ages' as the predictor variable and 'rates' as the target variable. The model's performance was evaluated using the Mean Squared Error (MSE) and the Coefficient of Determination (R^2 score).

The MSE of the model is 8902.91. The MSE measures the average squared difference between the actual and predicted values, with lower values indicating better fit. However, the MSE is highly dependent on the scale of the target variable, and therefore it can be difficult to interpret in isolation.

The R^2 score of the model is 0.06. The R^2 score represents the proportion of the variance for the target variable that's explained by the predictor variable(s) in the regression model. The R^2 score ranges from 0 to 1, with 1 indicating perfect prediction. In this case, the R^2 score is quite low, which suggests that the model's predictive power is not very strong. This could be due to a non-linear relationship between 'ages' and 'rates', or it could be that other variables not included in the model are influencing 'rates'.

The residuals plot and the histogram of residuals were also examined to understand the performance of the model. The residuals plot did not show any clear patterns, which is a good sign as it suggests that the model's errors are random. However, the histogram of residuals showed that the residuals are not perfectly normally distributed, which is an assumption of linear regression.

In conclusion, while the linear regression model provides some insight into the relationship between 'ages' and 'rates', its predictive power is limited. Further investigation and potentially more complex modeling techniques may be required to accurately predict 'rates' from 'ages'.

# start extract


In [None]:
# Import necessary libraries for polynomial regression
from sklearn.preprocessing import PolynomialFeatures

# Define the degree of the polynomial features
degree = 2

# Create a PolynomialFeatures object
poly_features = PolynomialFeatures(degree=degree)

# Transform the predictor variable
X_poly = poly_features.fit_transform(X)

# Split the data into training set and test set
X_train_poly, X_test_poly, y_train, y_test = train_test_split(X_poly, y, test_size=0.2, random_state=42)

# Create a linear regression model and fit it to the polynomial features training data
model_poly = LinearRegression()
model_poly.fit(X_train_poly, y_train)

# Make predictions using the test set
y_pred_poly = model_poly.predict(X_test_poly)

# The coefficients
print('Coefficients: \n', model_poly.coef_)
# The mean squared error
print('Mean squared error: %.2f' % mean_squared_error(y_test, y_pred_poly))
# The coefficient of determination: 1 is perfect prediction
print('Coefficient of determination: %.2f' % r2_score(y_test, y_pred_poly))

# Plot outputs
plt.scatter(X_test[:, 0], y_test,  color='black')
plt.scatter(X_test[:, 0], y_pred_poly, color='blue')
plt.title('Polynomial Regression')
plt.xlabel('Ages')
plt.ylabel('Rates')
plt.show()

In [None]:
# Plot outputs
plt.scatter(X_test_poly[:, 1], y_test,  color='black')
plt.scatter(X_test_poly[:, 1], y_pred_poly, color='blue')
plt.title('Polynomial Regression')
plt.xlabel('Ages')
plt.ylabel('Rates')
plt.show()

In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd

# Set a random seed for reproducibility
np.random.seed(42)

# Generate ages from 16 to 75
ages = np.random.randint(16, 76, size=100)

# Generate rates that linearly decrease with age with some random noise
rates = 100 - ages + np.random.normal(0, 10, size=100)

# Create a dataframe
df_linear = pd.DataFrame({'ages': ages, 'rates': rates})

# Write the dataframe to a csv file
df_linear.to_csv('linear_data.csv', index=False)

df_linear.head()

In [None]:
# Load the new data
df_linear = pd.read_csv('linear_data.csv')

# Display the first few rows of the dataframe
df_linear.head()

In [None]:
# Drop negative values
df_linear = df_linear[df_linear['rates'] >= 0]

# Drop null values
df_linear = df_linear.dropna()

# Display the first few rows of the dataframe
df_linear.head()

In [None]:
# Identify outliers
Q1 = df_linear.quantile(0.25)
Q3 = df_linear.quantile(0.75)
IQR = Q3 - Q1

# Remove outliers
df_linear_no_outliers = df_linear[~((df_linear < (Q1 - 1.5 * IQR)) | (df_linear > (Q3 + 1.5 * IQR))).any(axis=1)]

# Display the first few rows of the dataframe without outliers
df_linear_no_outliers.head()

In [None]:
# Plot ages versus rates for the original dataframe
plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
plt.scatter(df_linear['ages'], df_linear['rates'])
plt.title('Original Data')
plt.xlabel('Ages')
plt.ylabel('Rates')

# Plot ages versus rates for the dataframe without outliers
plt.subplot(1, 2, 2)
plt.scatter(df_linear_no_outliers['ages'], df_linear_no_outliers['rates'])
plt.title('Data Without Outliers')
plt.xlabel('Ages')
plt.ylabel('Rates')

plt.tight_layout()
plt.show()

# Box and whisker plots
plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
df_linear.boxplot(column=['rates'])
plt.title('Original Data')

plt.subplot(1, 2, 2)
df_linear_no_outliers.boxplot(column=['rates'])
plt.title('Data Without Outliers')

plt.tight_layout()
plt.show()

In [None]:
# Import necessary libraries
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Define predictor and target variables
X = df_linear_no_outliers[['ages']]
y = df_linear_no_outliers['rates']

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a linear regression object
regr = LinearRegression()

# Train the model using the training sets
regr.fit(X_train, y_train)

# Make predictions using the testing set
y_pred = regr.predict(X_test)

# Print the coefficients
print('Coefficients:', regr.coef_)

# Print the mean squared error
print('Mean squared error:', mean_squared_error(y_test, y_pred))

# Print the coefficient of determination (R^2 score)
print('Coefficient of determination (R^2 score):', r2_score(y_test, y_pred))

In [None]:
# Plot scatter plot of the test data and the predicted regression line
plt.scatter(X_test, y_test, color='black')
plt.plot(X_test, y_pred, color='blue', linewidth=3)
plt.title('Linear Regression Model')
plt.xlabel('Ages')
plt.ylabel('Rates')
plt.show()

# Plot residuals
plt.scatter(y_pred, y_test - y_pred, color='black')
plt.hlines(y=0, xmin=y_test.min(), xmax=y_test.max(), color='blue')
plt.title('Residuals')
plt.xlabel('Predicted')
plt.ylabel('Residuals')
plt.show()

## Model Evaluation

The linear regression model has a coefficient of -0.9446, which indicates that as age increases, the rate decreases, which aligns with our expectations from the data.

The mean squared error of the model is 75.25. This is a measure of the average squared difference between the actual and predicted values, with lower values indicating a better fit of the model to the data.

The coefficient of determination (R^2 score) is 0.793. This score ranges from 0 to 1 and represents the proportion of the variance in the dependent variable that is predictable from the independent variable(s). A score of 0.793 indicates that approximately 79.3% of the variability in rates can be explained by age, which suggests a strong relationship.

The scatter plot of the test data and the predicted regression line shows a clear negative linear relationship, which is what we would expect given the negative coefficient of the model.

The residuals plot shows how the prediction errors (residuals) are distributed. Ideally, we would like to see a random distribution of residuals around the horizontal axis. In this case, the residuals appear to be randomly distributed around zero, suggesting that a linear model is appropriate for this data.

# end extract

In [None]:
# Import necessary libraries
from sklearn.preprocessing import PolynomialFeatures

# Create a PolynomialFeatures object
poly = PolynomialFeatures(degree=2)

# Transform the x data for polynomial regression
X_poly = poly.fit_transform(X)

# Fit the polynomial regression model
poly_regr = LinearRegression()
poly_regr.fit(X_poly, y)

# Make predictions
y_poly_pred = poly_regr.predict(X_poly)

# Print the coefficients
print('Coefficients:', poly_regr.coef_)

# Print the mean squared error
print('Mean squared error:', mean_squared_error(y, y_poly_pred))

# Print the coefficient of determination (R^2 score)
print('Coefficient of determination (R^2 score):', r2_score(y, y_poly_pred))

In [None]:
# Plot scatter plot of the data and the predicted polynomial regression line
plt.scatter(X, y, color='black')
plt.plot(X, y_poly_pred, color='blue', linewidth=3)
plt.title('Polynomial Regression Model')
plt.xlabel('Ages')
plt.ylabel('Rates')
plt.show()

# Plot residuals
plt.scatter(y_poly_pred, y - y_poly_pred, color='black')
plt.hlines(y=0, xmin=y.min(), xmax=y.max(), color='blue')
plt.title('Residuals')
plt.xlabel('Predicted')
plt.ylabel('Residuals')
plt.show()

## Model Evaluation

The polynomial regression model has coefficients of -0.9605 and 0.0000945 for the age and age-squared terms, respectively. This indicates that as age increases, the rate decreases, but at a decreasing rate, which aligns with our expectations from the data.

The mean squared error of the model is 96.88. This is a measure of the average squared difference between the actual and predicted values, with lower values indicating a better fit of the model to the data. This is slightly higher than the mean squared error of the linear model, suggesting that the polynomial model may not fit the data quite as well.

The coefficient of determination (R^2 score) is 0.751. This score ranges from 0 to 1 and represents the proportion of the variance in the dependent variable that is predictable from the independent variable(s). A score of 0.751 indicates that approximately 75.1% of the variability in rates can be explained by age, which suggests a strong relationship. However, this is slightly lower than the R^2 score of the linear model, again suggesting that the polynomial model may not fit the data quite as well.

The scatter plot of the data and the predicted polynomial regression line shows a clear negative relationship, which is what we would expect given the negative coefficient of the model. However, the relationship appears to be more complex than a simple linear relationship, which is why we used a polynomial model.

The residuals plot shows how the prediction errors (residuals) are distributed. Ideally, we would like to see a random distribution of residuals around the horizontal axis. In this case, the residuals appear to be randomly distributed around zero, suggesting that a polynomial model is appropriate for this data.

In [None]:
# Import necessary libraries
import numpy as np

# Set the seed for reproducibility
np.random.seed(42)

# Generate ages from 16 to 75
ages = np.random.randint(16, 76, 1000)

# Generate rates based on a quadratic polynomial
rates = 0.01 * ages**2 - 2 * ages + 100 + np.random.normal(0, 10, 1000)

# Create a dataframe
df_quadratic = pd.DataFrame({'ages': ages, 'rates': rates})

# Write the dataframe to a CSV file
df_quadratic.to_csv('quadratic_data.csv', index=False)

# Display the first few rows of the dataframe
df_quadratic.head()

In [None]:
# Drop negative values
df_quadratic = df_quadratic[df_quadratic['rates'] >= 0]

# Drop null values
df_quadratic = df_quadratic.dropna()

# Display the first few rows of the dataframe
df_quadratic.head()

In [None]:
# Import necessary libraries
from scipy import stats

# Calculate z-scores
z_scores = stats.zscore(df_quadratic)

# Get absolute z-scores
abs_z_scores = np.abs(z_scores)

# Create a boolean array indicating where the z-score is less than 3
filtered_entries = (abs_z_scores < 3).all(axis=1)

# Create a new dataframe with outliers removed
df_quadratic_no_outliers = df_quadratic[filtered_entries]

# Display the first few rows of the new dataframe
df_quadratic_no_outliers.head()

In [None]:
# Import necessary libraries
import seaborn as sns

# Plot ages versus rates for the original dataframe
sns.boxplot(x='ages', y='rates', data=df_quadratic)
plt.title('Boxplot of Rates by Age (Original Data)')
plt.show()

# Plot ages versus rates for the new dataframe
sns.boxplot(x='ages', y='rates', data=df_quadratic_no_outliers)
plt.title('Boxplot of Rates by Age (Outliers Removed)')
plt.show()

In [None]:
# Import necessary libraries
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Define the feature and target variables
X = df_quadratic_no_outliers[['ages']]
y = df_quadratic_no_outliers['rates']

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a linear regression object
regr = LinearRegression()

# Train the model using the training sets
regr.fit(X_train, y_train)

# Make predictions using the testing set
y_pred = regr.predict(X_test)

# Print the coefficients
print('Coefficients:', regr.coef_)

# Print the mean squared error
print('Mean squared error:', mean_squared_error(y_test, y_pred))

# Print the coefficient of determination (R^2 score)
print('Coefficient of determination (R^2 score):', r2_score(y_test, y_pred))

In [None]:
# Plot scatter plot of the data and the regression line
plt.scatter(X_test, y_test, color='black')
plt.plot(X_test, y_pred, color='blue', linewidth=3)
plt.title('Scatter Plot of Data and Regression Line')
plt.xlabel('Ages')
plt.ylabel('Rates')
plt.show()

# Plot residuals
residuals = y_test - y_pred
plt.scatter(X_test, residuals, color='black')
plt.title('Residuals Plot')
plt.xlabel('Ages')
plt.ylabel('Residuals')
plt.show()

The linear regression model has a coefficient of determination (R^2 score) of approximately 0.73, which indicates that about 73% of the variability in the 'rates' can be explained by the 'ages'. This is a relatively high R^2 score, suggesting that the model fits the data quite well.

However, it's important to note that the mean squared error (MSE) is approximately 104.54. The MSE is a measure of the quality of an estimator—it is always non-negative, and values closer to zero are better. The relatively high MSE suggests that the model could still be improved.

Looking at the scatter plot of the data and the regression line, we can see that the model captures the general trend in the data, but there are still quite a few points that are far from the line. This is also reflected in the residuals plot, where we can see that the residuals are not randomly scattered around the horizontal axis, which would be the case if the model fit the data perfectly.

Overall, the linear regression model provides a decent fit to the data, but there is still room for improvement. It's possible that a more complex model, such as a polynomial regression model, might provide a better fit to this data.

In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd

# Set the seed for reproducibility
np.random.seed(42)

# Generate ages from 16 to 75
ages = np.random.randint(16, 76, 1000)

# Generate rates based on a quadratic function of ages
rates = -0.01 * ages**2 + 1.5 * ages + np.random.normal(0, 10, 1000)

# Create a dataframe
df_quadratic = pd.DataFrame({'ages': ages, 'rates': rates})

# Write the dataframe to a CSV file
df_quadratic.to_csv('quadratic_data.csv', index=False)r

In [None]:
# Load the new data
df_quadratic = pd.read_csv('quadratic_data.csv')

# Display the first few rows of the dataframe
df_quadratic.head()

In [None]:
# Import necessary libraries
from sklearn.preprocessing import PolynomialFeatures

# Create a PolynomialFeatures object with degree 2
poly = PolynomialFeatures(degree=2)

# Transform the feature variable
X_poly = poly.fit_transform(X)

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X_poly, y, test_size=0.2, random_state=42)

# Create a linear regression object
regr = LinearRegression()

# Train the model using the training sets
regr.fit(X_train, y_train)

# Make predictions using the testing set
y_pred = regr.predict(X_test)

# Print the coefficients
print('Coefficients:', regr.coef_)

# Print the mean squared error
print('Mean squared error:', mean_squared_error(y_test, y_pred))

# Print the coefficient of determination (R^2 score)
print('Coefficient of determination (R^2 score):', r2_score(y_test, y_pred))

In [None]:
# Import necessary libraries
from scipy import stats

# Remove outliers
df_quadratic_no_outliers = df_quadratic[(np.abs(stats.zscore(df_quadratic)) < 3).all(axis=1)]

# Remove negative and null values
df_quadratic_no_outliers = df_quadratic_no_outliers[df_quadratic_no_outliers['rates'] > 0]

# Display the first few rows of the dataframe without outliers
df_quadratic_no_outliers.head()

In [None]:
# Plot scatter plot of the data and the regression line
plt.scatter(X_test[:, 1], y_test, color='black')
plt.scatter(X_test[:, 1], y_pred, color='blue', linewidth=3)
plt.title('Scatter Plot of Data and Regression Line')
plt.xlabel('Ages')
plt.ylabel('Rates')
plt.show()

# Plot residuals
residuals = y_test - y_pred
plt.scatter(X_test[:, 1], residuals, color='black')
plt.title('Residuals Plot')
plt.xlabel('Ages')
plt.ylabel('Residuals')
plt.show()

The polynomial regression model has a coefficient of determination (R^2 score) of approximately 0.75, which indicates that about 75% of the variability in the 'rates' can be explained by the 'ages'. This is a slight improvement over the linear regression model, which had an R^2 score of approximately 0.73.

The mean squared error (MSE) of the polynomial regression model is approximately 95.74, which is lower than the MSE of the linear regression model (104.54). This suggests that the polynomial regression model provides a better fit to the data.

Looking at the scatter plot of the data and the regression line, we can see that the polynomial regression model captures the quadratic trend in the data more accurately than the linear regression model. This is also reflected in the residuals plot, where the residuals are more randomly scattered around the horizontal axis, indicating a better fit to the data.

Overall, the polynomial regression model provides a better fit to the data than the linear regression model, as expected given that the data was generated using a quadratic polynomial.

In [None]:
# Import necessary libraries
import matplotlib.pyplot as plt

# Plot ages versus rates for the old dataframe showing outliers
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.scatter(df_quadratic['ages'], df_quadratic['rates'], color='black')
plt.title('Ages vs Rates (with outliers)')
plt.xlabel('Ages')
plt.ylabel('Rates')

# Plot box and whisker plots for the old dataframe showing outliers
plt.subplot(1, 2, 2)
plt.boxplot([df_quadratic['ages'], df_quadratic['rates']])
plt.title('Box and Whisker Plots (with outliers)')
plt.xticks([1, 2], ['Ages', 'Rates'])
plt.show()

# Plot ages versus rates for the new dataframe without outliers
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.scatter(df_quadratic_no_outliers['ages'], df_quadratic_no_outliers['rates'], color='black')
plt.title('Ages vs Rates (without outliers)')
plt.xlabel('Ages')
plt.ylabel('Rates')

# Plot box and whisker plots for the new dataframe without outliers
plt.subplot(1, 2, 2)
plt.boxplot([df_quadratic_no_outliers['ages'], df_quadratic_no_outliers['rates']])
plt.title('Box and Whisker Plots (without outliers)')
plt.xticks([1, 2], ['Ages', 'Rates'])
plt.show()

In [None]:
# Import necessary libraries
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df_quadratic_no_outliers[['ages']], df_quadratic_no_outliers['rates'], test_size=0.2, random_state=42)

# Create a linear regression object
regr = LinearRegression()

# Train the model using the training sets
regr.fit(X_train, y_train)

# Make predictions using the testing set
y_pred = regr.predict(X_test)

# The coefficients
print('Coefficients: \n', regr.coef_)
# The mean squared error
print('Mean squared error: %.2f' % mean_squared_error(y_test, y_pred))
# The coefficient of determination: 1 is perfect prediction
print('Coefficient of determination: %.2f' % r2_score(y_test, y_pred))

In [None]:
# Plot scatter of test data and predictions
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.scatter(X_test, y_test,  color='black')
plt.plot(X_test, y_pred, color='blue', linewidth=3)
plt.title('Scatter Plot of Test Data and Predictions')
plt.xlabel('Ages')
plt.ylabel('Rates')

# Plot residuals
plt.subplot(1, 2, 2)
plt.scatter(y_pred, y_pred - y_test, color='black')
plt.hlines(y=0, xmin=y_pred.min(), xmax=y_pred.max(), color='blue')
plt.title('Residuals')
plt.xlabel('Predicted')
plt.ylabel('Residuals')
plt.show()

## Evaluation of the Linear Regression Model

The coefficients of the linear regression model are approximately 0.57. This means that for each year increase in age, the rate increases by about 0.57.

The mean squared error (MSE) of the model is approximately 119.34. The MSE is a measure of the quality of an estimator—it is always non-negative, and values closer to zero are better.

The coefficient of determination, denoted R² or r² and pronounced "R squared", is approximately 0.53. This is a statistical measure that represents the proportion of the variance for a dependent variable that's explained by an independent variable or variables in a regression model. While it is often interpreted as the proportion of variance explained by the model, this is only correct for simple linear regression models and not generally true for multivariate linear regression models or non-linear regression models. An R² of 100 percent indicates that all changes in the dependent variable are completely explained by changes in the independent variable(s). In this case, 53% of the variability in rates is explained by age, which is not very high.

From the scatter plot of the test data and predictions, we can see that the model does not fit the data perfectly. There is a clear pattern in the residuals plot, indicating that the model is not capturing some aspect of the data. This is expected as we know the data is generated from a quadratic function, and a linear model may not be the best fit for this data.

In conclusion, the linear regression model is not a good fit for the data. A polynomial regression model may be a better choice for this data.

In [None]:
# Import necessary libraries
from sklearn.preprocessing import PolynomialFeatures

# Create a PolynomialFeatures object with degree 2
poly = PolynomialFeatures(degree=2)

# Transform the x data for polynomial regression
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)

# Create a linear regression object
poly_regr = LinearRegression()

# Train the model using the training sets
poly_regr.fit(X_train_poly, y_train)

# Make predictions using the testing set
y_pred_poly = poly_regr.predict(X_test_poly)

# The coefficients
print('Coefficients: \n', poly_regr.coef_)
# The mean squared error
print('Mean squared error: %.2f' % mean_squared_error(y_test, y_pred_poly))
# The coefficient of determination: 1 is perfect prediction
print('Coefficient of determination: %.2f' % r2_score(y_test, y_pred_poly))

In [None]:
# Plot scatter of test data and predictions
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.scatter(X_test, y_test,  color='black')
plt.plot(X_test, y_pred_poly, color='blue', linewidth=3)
plt.title('Scatter Plot of Test Data and Polynomial Predictions')
plt.xlabel('Ages')
plt.ylabel('Rates')

# Plot residuals
plt.subplot(1, 2, 2)
plt.scatter(y_pred_poly, y_pred_poly - y_test, color='black')
plt.hlines(y=0, xmin=y_pred_poly.min(), xmax=y_pred_poly.max(), color='blue')
plt.title('Residuals of Polynomial Regression')
plt.xlabel('Predicted')
plt.ylabel('Residuals')
plt.show()

## Evaluation of the Polynomial Regression Model

The coefficients of the polynomial regression model are approximately 1.78 for the linear term and -0.013 for the quadratic term. This means that the rate increases by about 1.78 for each year increase in age, but the rate of increase decreases by about 0.013 for each year increase in age squared. This is consistent with the quadratic function we used to generate the data.

The mean squared error (MSE) of the model is approximately 113.48. The MSE is a measure of the quality of an estimator—it is always non-negative, and values closer to zero are better. The MSE of the polynomial regression model is slightly lower than that of the linear regression model, indicating that the polynomial regression model is a better fit for the data.

The coefficient of determination, denoted R² or r² and pronounced "R squared", is approximately 0.55. This is a statistical measure that represents the proportion of the variance for a dependent variable that's explained by an independent variable or variables in a regression model. While it is often interpreted as the proportion of variance explained by the model, this is only correct for simple linear regression models and not generally true for multivariate linear regression models or non-linear regression models. An R² of 100 percent indicates that all changes in the dependent variable are completely explained by changes in the independent variable(s). In this case, 55% of the variability in rates is explained by age, which is slightly higher than the R² of the linear regression model.

From the scatter plot of the test data and polynomial predictions, we can see that the model fits the data better than the linear regression model. The residuals plot also shows less pattern than the linear regression model, indicating that the polynomial regression model is capturing more of the data's structure.

In conclusion, the polynomial regression model is a better fit for the data than the linear regression model.

# Data Analysis Notebook

This notebook contains a series of data analyses performed on different datasets. The datasets are generated to follow specific patterns (linear, quadratic) and the goal is to fit appropriate models to these data and evaluate their performance.

## Table of Contents

1. [Linear Data Analysis](#linear-data-analysis)
2. [Quadratic Data Analysis](#quadratic-data-analysis)

Each section includes data loading, data cleaning, exploratory data analysis, model fitting, model evaluation, and visualizations.

## Linear Data Analysis <a name="linear-data-analysis"></a>

In this section, we analyze a dataset that follows a linear pattern. The steps include:

1. **Data Loading**: Load the data from a CSV file into a pandas DataFrame.
2. **Data Cleaning**: Remove outliers and negative values from the data.
3. **Exploratory Data Analysis**: Generate descriptive statistics and visualizations to understand the data.
4. **Model Fitting**: Fit a linear regression model to the data.
5. **Model Evaluation**: Evaluate the performance of the model using metrics such as mean squared error and R-squared.
6. **Visualizations**: Generate plots of the data and the model predictions.

## Quadratic Data Analysis <a name="quadratic-data-analysis"></a>

In this section, we analyze a dataset that follows a quadratic pattern. The steps include:

1. **Data Loading**: Load the data from a CSV file into a pandas DataFrame.
2. **Data Cleaning**: Remove outliers and negative values from the data.
3. **Exploratory Data Analysis**: Generate descriptive statistics and visualizations to understand the data.
4. **Model Fitting**: Fit a polynomial regression model to the data.
5. **Model Evaluation**: Evaluate the performance of the model using metrics such as mean squared error and R-squared.
6. **Visualizations**: Generate plots of the data and the model predictions.