# Linear Regression

Linear regression analysis is used to predict the value of a variable based on the value of another variable. The variable you want to predict is called the dependent variable. The variable you are using to predict the other variable's value is called the independent variable

This week, your task involves conducting multi-class linear regression on batsmen salaries. You'll use the average runs scored per game and the strike rate as independent variables. The goal is to predict the salary as the dependent variable. Additionally, you'll be categorizing the data based on the years.

The dataset is Data_Mendeley.csv given on GitHub. Feel free to create any new functions required.

In [32]:
#import important libraries
import numpy as np
from sklearn import datasets
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split

preparing data

In [33]:
#mounting gdrive
from google.colab import drive
drive.mount('/content/gdrive')

df=pd.read_csv('/content/gdrive/MyDrive/Classroom/Assignment/Data_Mendeley.csv')
df.columns

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


Index(['Id', 'Name', 'Year', 'Final Price', 'Role', 'Nationality', 'Team',
       'Ent', 'Age', 'Matches', 'LMatches', 'Runs', 'LRuns', 'HS', 'LHS',
       'Ave', 'LAve', 'StrRate', 'LStrRate', 'Fifties', 'LFifties', 'Hundreds',
       'LHundreds', 'Fours', 'LFours', 'Sixes', 'LSixes', 'Catches',
       'LCatches', 'Stumps', 'LStumps', 'Wkts', 'LWkts', 'Econ', 'LEcon',
       'FourWkts', 'LFourWkts', 'FiveWkts', 'LFiveWkts', 'Indian',
       'Specialist', 'Status'],
      dtype='object')

In [34]:
x=df[['Runs','StrRate']].copy()
Y=df['Final Price'].values
df['Year'] = df['Year'].astype('category') #categorising the data
x_train, x_test, Y_train, Y_test = train_test_split(x, Y, test_size=0.3, random_state=1234)

Forward pass

In [None]:
def forward(x):
  pass

Mean Squared Loss

In [None]:
def loss(y,y_pred): #Mean Squared Loss
  pass

Implement Linear regression here :)

In [35]:
Y_train=Y_train.reshape(x_train.shape[0],1)
Y_test=Y_test.reshape(x_test.shape[0],1)

In [36]:
def linmodel(X, Y, alpha, iterations,lambda_):
    m, n = X.shape
    w = np.random.rand(n, 1)
    b = 0
    cost_history = []

    for i in range(iterations):
        f_wb = np.dot(X, w) + b

        # Mean Squared Error cost function
        cost = np.sum((f_wb-Y) ** 2) / (2 * m) + (lambda_*np.sum(w**2))/(2*m)   #to prevent overfitting

        dw = (1/m) * np.dot(X.T, (f_wb - Y)) + (lambda_*w)/m
        db = (1/m) * np.sum(f_wb - Y)

        # Update weights and bias
        w = w - alpha * dw
        b = b - alpha * db

        cost_history.append(cost)

        if i % (iterations / 10) == 0:
            print(f"The cost after {i} iteration is {cost_history[-1]}")

    return w, b, cost_history


In [37]:

for year, group in df.groupby('Year'):
    X = group[['Runs','StrRate']]
    y = group[['Final Price']]
    X_train, X_test, y_train, y_test= train_test_split(x, Y, test_size=0.3, random_state=1234)
    y_train=y_train.reshape(X_train.shape[0],1)
    y_test=y_test.reshape(X_test.shape[0],1)
    w,b,c = linmodel(X_train,y_train,0.00001,10000,1)
    y_pred=np.dot(X_test,w)+b
    y_mean=y_pred.mean()
    r2_value=1-(np.sum((y_test-y_pred)**2)/np.sum((y_test-y_mean)**2))
    print("")
    r.append(print(f"Year: {year},  Coefficients: {w.T}, Intercept: {b}"))
    print("")
    print(f"the R squared value is {r2_value}")
    print(" ")


The cost after 0 iteration is 760620018831999.5
The cost after 1000 iteration is 363646623546349.6
The cost after 2000 iteration is 363533507225901.1
The cost after 3000 iteration is 363421007501360.3
The cost after 4000 iteration is 363309121011669.25
The cost after 5000 iteration is 363197844414091.0
The cost after 6000 iteration is 363087174384109.9
The cost after 7000 iteration is 362977107615332.1
The cost after 8000 iteration is 362867640819387.06
The cost after 9000 iteration is 362758770725828.8

Year: 2008,  Coefficients: [[ 78176.22513439 125117.45123962]], Intercept: 334154.32776374364

the R squared value is 0.17311449050457028
 
The cost after 0 iteration is 760617768927001.2
The cost after 1000 iteration is 363646623559335.1
The cost after 2000 iteration is 363533507238815.8
The cost after 3000 iteration is 363421007514204.6
The cost after 4000 iteration is 363309121024443.56
The cost after 5000 iteration is 363197844426795.7
The cost after 6000 iteration is 3630871743967

# Logistic Regression

Logistic regression is a process of modeling the probability of a discrete outcome given an input variable. The most common logistic regression models a binary outcome; something that can take two values such as true/false, yes/no, and so on.

In this week you will be doing logistic regression on breast cancer dataset using sklearn library. Feel free to create any new functions required.

In [38]:
#importinf libraries
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

Prepare Data

In [39]:
breast_cancer = datasets.load_breast_cancer()
X, y = breast_cancer.data, breast_cancer.target

In [40]:
#spliting data for training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1234)
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

Forward pass

In [None]:
def forward_log(x):
  pass

Binary cross entropy loss

In [None]:
def BCELoss(y,y_pred):
  pass

Implement Logistic Regression here :)

In [41]:
y_train=y_train.reshape(X_train.shape[0],1)
y_test=y_test.reshape(X_test.shape[0],1)

In [42]:
print("Shape of X_train : ", X_train.shape)
print("Shape of y_train : ", y_train.shape)
print("Shape of X_test : ", X_test.shape)
print("Shape of y_test : ", y_test.shape)

Shape of X_train :  (455, 30)
Shape of y_train :  (455, 1)
Shape of X_test :  (114, 30)
Shape of y_test :  (114, 1)


In [43]:
def sigmoid(x):
    return 1/(1 + np.exp(-x))

In [44]:
def logmodel(X,Y,alpha, iterations):
  m=Y.shape[0]
  n=X.shape[1]
  w=np.zeros((n,1))
  b=0
  cost_history=[]

  for i in range(iterations):
    Z=np.dot(X,w)+b
    f_wb=sigmoid(Z)

    #cost function
    cost = -((np.sum(Y * np.log(f_wb) + (1 - Y) * np.log(1 - f_wb))) / (2*m))


    dw=(1/m)*np.dot(((f_wb-Y).T),X)
    db=(1/m)*np.sum(f_wb-Y)

    w=w-(alpha*dw.T)
    b=b-alpha*db

    cost_history.append(cost)
    if i%(iterations/10)==0:
      print(f"the cost after {i} iteration is {cost_history[-1]}")

  return w,b,f_wb




In [45]:
w,b,f=logmodel(X_train,y_train,0.1,10000)

the cost after 0 iteration is 0.34657359027997264
the cost after 1000 iteration is 0.020427894488219012
the cost after 2000 iteration is 0.01666296182398825
the cost after 3000 iteration is 0.014922674653109548
the cost after 4000 iteration is 0.013860657186592622
the cost after 5000 iteration is 0.013116773847611804
the cost after 6000 iteration is 0.012549388656593536
the cost after 7000 iteration is 0.012091370528804666
the cost after 8000 iteration is 0.011706651535829
the cost after 9000 iteration is 0.011373988550930829


In [46]:
#so the logistic model fitted is y=sigmoid(X*w+b)

In [47]:
y_pred=sigmoid(np.dot(X_test,w)+b)
for i in range(y_pred.shape[0]):
  if y_pred[i]<=0.4:                 #threshold 0.4
     y_pred[i]=0
  else:
     y_pred[i]=1

In [48]:
pr=[]
for i in range(y_pred.shape[0]):
  if y_pred[i]==y_test[i]:
    pr.append(1)
  else:
    pr.append(0)
accuracy_score=np.sum(pr)/y_pred.shape[0]
print(f"the accuracy score is {accuracy_score}")


the accuracy score is 0.9649122807017544
