# 50_Startups

**Linear Regression:**
XYZ is a venture capital firms in the US. They are planning to invest in some of the potential startups across the US. They have already invested in around fifty startups in Florida, Newyork and California. You are a data analyst at XYZ and you have been asked to build a model that predicts the profitability of the firm’s future investments. Below is the snapshot of the data you have.

In [None]:
# Suppressing Warnings
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Importing Pandas and NumPy
import pandas as pd, numpy as np

In [None]:
md = pd.read_csv("50_Startups.csv")

In [None]:
md.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94


In [None]:
md.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   R&D Spend        50 non-null     float64
 1   Administration   50 non-null     float64
 2   Marketing Spend  50 non-null     float64
 3   State            50 non-null     object 
 4   Profit           50 non-null     float64
dtypes: float64(4), object(1)
memory usage: 2.1+ KB


In [None]:
md.State.unique()

array(['New York', 'California', 'Florida'], dtype=object)

In [None]:
dummy = pd.get_dummies(md.State, drop_first = True)
dummy.head()

Unnamed: 0,Florida,New York
0,0,1
1,0,0
2,1,0
3,0,1
4,1,0


In [None]:
md = pd.concat([md, dummy], axis = 1)
md.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit,Florida,New York
0,165349.2,136897.8,471784.1,New York,192261.83,0,1
1,162597.7,151377.59,443898.53,California,191792.06,0,0
2,153441.51,101145.55,407934.54,Florida,191050.39,1,0
3,144372.41,118671.85,383199.62,New York,182901.99,0,1
4,142107.34,91391.77,366168.42,Florida,166187.94,1,0


In [None]:
md = md.drop(["State"], axis = 1)
md.head()

<hi>Test Train Split<hi>


In [None]:
from sklearn.model_selection import train_test_split

In [None]:
# from sklearn.model_selection import train_test_split

# We specify this so that the train and test data set always have the same rows, respectively
# np.random.seed(0)
md_train, md_test = train_test_split(md, train_size = 0.7, test_size = 0.3, 
                                     random_state = 100)
print(md_train.shape)
print(md_test.shape)

<hi>Looking for Correlations<hi>

In [None]:
# Importing matplotlib and seaborn
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
# Let's see the correlation matrix 
plt.figure(figsize = (20,10))        # Size of the figure
sns.heatmap(md_train.corr(),annot = True)
plt.show()

In [None]:
md["Profit"].skew()

In [None]:
md_train.describe()

<ui>Feature Scaling<ui>

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
scaler = StandardScaler()

md_train[['R&D Spend','Administration','Marketing Spend', 'Profit']] = scaler.fit_transform(md_train[['R&D Spend','Administration','Marketing Spend', 'Profit']])

md_train.head()

<ui>Dividing into X and Y sets for the model building<ui>

In [None]:
y_train = md_train.pop('Profit')
X_train = md_train

In [None]:
y_train.head()

In [None]:
X_train.head()

<ui>Building a linear model<ui>

<ui>Fit a regression line through the training data using statsmodels. Remember that in statsmodels, you need to explicitly fit a constant using sm.add_constant(X) because if we don't perform this step, statsmodels fits a regression line passing through the origin, by default.<ui>

In [None]:
import statsmodels.api as sm

In [None]:
# Add a constant
X_train_lm1 = sm.add_constant(X_train)
X_train_lm1.head()

In [None]:
# Create a first fitted model
lr_1 = sm.OLS(y_train, X_train_lm1).fit()

In [None]:
print(lr_1.summary())

<ui>Checking VIF<ui>

In [None]:
# Check for the VIF values of the feature variables. 
from statsmodels.stats.outliers_influence import variance_inflation_factor

In [None]:
# Create a dataframe that will contain the names of all the feature variables and their respective VIFs
vif = pd.DataFrame()
vif['Features'] = X_train.columns
vif['VIF'] = [variance_inflation_factor(X_train.values, i) for i in range(X_train.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

<ui>Dropping the variable and updating the model<ui>

In [None]:
# Dropping highly correlated variables and insignificant variables
X1 = X_train.drop('Administration', axis = 1)

In [None]:
X1.head()

In [None]:
# Build a second model

X_train_lm2= sm.add_constant(X1)

# X_train_lm2.head()
lr_2 = sm.OLS(y_train, X_train_lm2).fit()

In [None]:
# Print the summary of the model
print(lr_2.summary())



In [None]:
# Dropping highly correlated variables and insignificant variables
X2 = X1.drop('Florida', axis = 1)
X2.head()

In [None]:
# Build a second model

X_train_lm3= sm.add_constant(X2)

lr_3 = sm.OLS(y_train, X_train_lm3).fit()

In [None]:
# Print the summary of the model
print(lr_3.summary())

In [None]:
# Dropping highly correlated variables and insignificant variables
X3 = X2.drop('New York', axis = 1)
X3.head()

<ui># Build a second model<ui>

In [None]:

X_train_lm4= sm.add_constant(X3)

lr_4 = sm.OLS(y_train, X_train_lm4).fit()

In [None]:
# Print the summary of the model
print(lr_4.summary())

In [None]:
# Dropping highly correlated variables and insignificant variables
X4 = X3.drop('Marketing Spend', axis = 1)
X4.head()

In [None]:
# Build a second model

X_train_lm5= sm.add_constant(X4)

lr_5 = sm.OLS(y_train, X_train_lm5).fit() 

In [None]:
# Print the summary of the model
print(lr_5.summary())

In [None]:
pred_train =  lr_3.predict(X_train_lm3)

In [None]:
# Plot the histogram of the error terms
fig = plt.figure()
sns.distplot((y_train - pred_train), bins = 10)
fig.suptitle('Error Terms', fontsize = 20)                  # Plot heading 
plt.xlabel('Errors', fontsize = 18)                         # X-label
# plt.savefig("Error")

In [None]:
# create dataframe from X, y for easier plot handling
dataframe = pd.concat([X2, y_train], axis=1)
# model values
model_fitted_y = lr_3.fittedvalues
# model residuals
model_residuals = lr_3.resid
# normalized residuals
model_norm_residuals = lr_3.get_influence().resid_studentized_internal
# absolute squared normalized residuals
model_norm_residuals_abs_sqrt = np.sqrt(np.abs(model_norm_residuals))
# absolute residuals
model_abs_resid = np.abs(model_residuals)
# leverage, from statsmodels internals
model_leverage = lr_3.get_influence().hat_matrix_diag
# cook's distance, from statsmodels internals
model_cooks = lr_3.get_influence().cooks_distance[0]

In [None]:
from statsmodels.graphics.gofplots import ProbPlot
QQ = ProbPlot(model_norm_residuals)
plot_lm_2 = QQ.qqplot(line='45', alpha=0.5, color='#4C72B0', lw=1)
plot_lm_2.axes[0].set_title('Normal Q-Q')
plot_lm_2.axes[0].set_xlabel('Theoretical Quantiles')
plot_lm_2.axes[0].set_ylabel('Standardized Residuals');
# annotations
abs_norm_resid = np.flip(np.argsort(np.abs(model_norm_residuals)), 0)
abs_norm_resid_top_3 = abs_norm_resid[:3]
for r, i in enumerate(abs_norm_resid_top_3):
    plot_lm_2.axes[0].annotate(i,
                               xy=(np.flip(QQ.theoretical_quantiles, 0)[r],
                                   model_norm_residuals[i]));

In [None]:
plt.scatter(md_train.index, (y_train - pred_train))
plt.suptitle('Pattern of Error Terms', fontsize = 20)  
plt.xlabel('Observations', fontsize = 18) 
plt.ylabel("Residuals", fontsize = 18)
plt.axhline(y=0.0, color='r', linestyle='-')
# plt.savefig("Pattern of Error Term")

In [None]:
dataframe.head()

In [None]:
plot_lm_1 = plt.figure()
plot_lm_1.axes[0] = sns.residplot(model_fitted_y, dataframe.columns[-1], data=dataframe,
                          lowess=True,
                          scatter_kws={'alpha': 0.5},
                          line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8})

plot_lm_1.axes[0].set_title('Residuals vs Fitted')
plot_lm_1.axes[0].set_xlabel('Fitted values')
plot_lm_1.axes[0].set_ylabel('Residuals');

In [None]:
# create dataframe from X, y for easier plot handling
dataframe1 = pd.concat([X3, y_train], axis=1)
# model values
model_fitted_y = lr_4.fittedvalues
# model residuals
model_residuals = lr_4.resid
# normalized residuals
model_norm_residuals = lr_4.get_influence().resid_studentized_internal
# absolute squared normalized residuals
model_norm_residuals_abs_sqrt = np.sqrt(np.abs(model_norm_residuals))
# absolute residuals
model_abs_resid = np.abs(model_residuals)
# leverage, from statsmodels internals
model_leverage = lr_4.get_influence().hat_matrix_diag
# cook's distance, from statsmodels internals
model_cooks = lr_4.get_influence().cooks_distance[0]

In [None]:
pred_train2 = lr_4.predict(X_train_lm4)

In [None]:
# Plot the histogram of the error terms
fig = plt.figure()
sns.distplot((y_train - pred_train2), bins = 10)
fig.suptitle('Error Terms', fontsize = 20)                  # Plot heading 
plt.xlabel('Errors', fontsize = 18)                         # X-label
# plt.savefig("Error")

In [None]:
from statsmodels.graphics.gofplots import ProbPlot
QQ = ProbPlot(model_norm_residuals)
plot_lm_3 = QQ.qqplot(line='45', alpha=0.5, color='#4C72B0', lw=1)
plot_lm_3.axes[0].set_title('Normal Q-Q')
plot_lm_3.axes[0].set_xlabel('Theoretical Quantiles')
plot_lm_3.axes[0].set_ylabel('Standardized Residuals');
# annotations
abs_norm_resid = np.flip(np.argsort(np.abs(model_norm_residuals)), 0)
abs_norm_resid_top_3 = abs_norm_resid[:3]
for r, i in enumerate(abs_norm_resid_top_3):
    plot_lm_2.axes[0].annotate(i,
                               xy=(np.flip(QQ.theoretical_quantiles, 0)[r],
                                   model_norm_residuals[i]));

In [None]:
plt.scatter(md_train.index, (y_train - pred_train2))
plt.suptitle('Pattern of Error Terms', fontsize = 20)  
plt.xlabel('Observations', fontsize = 18) 
plt.ylabel("Residuals", fontsize = 18)
plt.axhline(y=0.0, color='r', linestyle='-')
# plt.savefig("Pattern of Error Term")

In [None]:
plot_lm_1 = plt.figure()
plot_lm_1.axes[0] = sns.residplot(model_fitted_y, dataframe1.columns[-1], data=dataframe1,
                          lowess=True,
                          scatter_kws={'alpha': 0.5},
                          line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8})

plot_lm_1.axes[0].set_title('Residuals vs Fitted')
plot_lm_1.axes[0].set_xlabel('Fitted values')
plot_lm_1.axes[0].set_ylabel('Residuals');

In [None]:
mar_sp_sq_neg = -(md_train["Marketing Spend"]**2)
mar_sp_sq_neg.head()

In [None]:
X2_sq = X2.drop('Marketing Spend', axis = 1)
X2_sq.head()

In [None]:
X2_sq = pd.concat([X2_sq, mar_sp_sq_neg], axis = 1)

In [None]:
X2_sq.head()