# Machine Learning

**Machine Learning (ML)** is a branch of *artificial intelligence* that aims at building systems that can learn from data, identify patterns and make decisions with minimal human intervention.

# Supervised Machine Learning

**Supervised Learning** is a type of machine learning where an algorithm is trained on a **labeled dataset**, consisting of input-output pairs. The algorithm learns the mapping between input data and corresponding output labels, allowing it to make predictions on decisions when given new, unseen input.

## 1. Linear Regression

The ultimate goal of linear regression is to **find a line that best fits the data**.

  ###    i. Simple Linear Regression

If a single independent variable is used to predict the value of a numerical dependent variable then such a linear regression algorithm is called **Simple Linear Regression.**

In [505]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn import linear_model
from sklearn.linear_model import LinearRegression

In [506]:
import warnings
warnings.simplefilter(action="ignore",category=FutureWarning)
#code to ignore the future warnings

In [507]:
import warnings
warnings.filterwarnings("ignore")   #To avoid the warnings

**Scikit Learn** is the most popular machine learning package for python and has a lot of algorithms built-in.

In [508]:
info={"Age":[25,30,35,40,45],"Premium":[18000,32000,40000,47000,55000]}

In [509]:
info

{'Age': [25, 30, 35, 40, 45], 'Premium': [18000, 32000, 40000, 47000, 55000]}

In [None]:
df=pd.DataFrame(info)

In [None]:
df

In [None]:
sns.pairplot(df)
plt.show()

In [None]:
X = np.array(df["Age"]).reshape(-1, 1)
y = np.array(df["Premium"])

In [None]:
from sklearn import linear_model
reg=linear_model.LinearRegression()

In [None]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X, y)

In [None]:
predictions = model.predict(X)

In [None]:
plt.scatter(X, y, label="Actual Data")
plt.plot(X, predictions, color='green', label="Linear Regression")
plt.xlabel("Age")
plt.ylabel("Premium")
plt.title("Linear Regression Model")
plt.legend()
plt.show()

In [None]:
reg.fit(df[["Age"]],df["Premium"])
#here,fit means we gonna train the regression model using the data points we have.#3#
#"Age" is the independent variable
#"Premium" is the dependent variable

### Predicting the unseen data in dataset

### Finding the premium for the age 21 (new data/unseen data)

In [None]:
reg.predict([[21]])  #finding the premium for the age 21

### Finding the premium for the age 50 (new data/unseen data)

In [None]:
reg.predict([[50]]) #finding the premium for the age 50

In [None]:
reg.coef_   #slope

In [None]:
reg.intercept_   #line interseted on y_axis

### Validating our Model

### y=mx+c

In [None]:
1780*21+(-23900)   #predicting the premium amount for age 21

In [None]:
1780*50+(-23900)  #predicting the premium amount for age 50

### ii. Multiple Linear Regression

If more than one independent variable is used to predict the value of numerical dependent variable, then such a linear regression algorithm is called **multiple linear regression.**

In [None]:
data={"Age":[25,30,35,40,45],"Height":[162.56,172.72,167.64,165.10,157.48],"Weight":[70,95,78,110,85],"Premium":[18000,38000,38000,60000,70000]}

In [None]:
df=pd.DataFrame(data)

In [None]:
df

In [None]:
df.isna().sum() #to check the null values in the data

In [None]:
sns.pairplot(df,hue="Premium")
plt.show()

In [None]:
reg=linear_model.LinearRegression()

In [None]:
reg.fit(df[["Age","Height","Weight"]],df["Premium"])

In [None]:
reg.coef_

In [None]:
reg.intercept_

### Finding out the Premium whose Age 27, Height 167.56, Weight 60

In [None]:
reg.predict([[27,167.56,60]])

### Validating the model

### y=m1x1+m2x2+m3x3+c

In [None]:
2150.26052416*27+-248.45851574*167.56+312.65291961*60+-16827.013154824934

### Finding out the Premium whose Age 60, Height 165.10, Weight 80

In [None]:
reg.predict([[60,165.10,80]])

### Validating the model

### y=m1x1+m2x2+m3x3+c

In [None]:
2150.26052416*60+-248.45851574*165.10+312.65291961*80+-16827.013154824934

# 2. Logistic Regression

**Logistic Regression** is a machine learning algorithm based on supervised learning. It is used to describe data to explain the relationshib between one dapendent binary variable and one or more independent variables. Logistic Equation is created in such a way that the output of a probability value that can be mapped to classes and values can only be between 0 and 1
    
    1. Binary Classification
    2. Multiclass Classification

### 1. Binary Classification

In [None]:
df=pd.read_csv("insurance_data.csv")

In [None]:
df.head()

In [None]:
sns.pairplot(df)
plt.show()

In [None]:
#df["bought_insurance"].replace{"no":"0","yes":"1",inplace=True}   (converting the categorical data into binary format)

In [None]:
e = np.array(df["age"]).reshape(-1, 1)
f = np.array(df["bought_insurance"])

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
x_train,x_test,y_train,y_test=train_test_split(e,f,test_size=0.2,random_state=0)

In [None]:
len(x_train)

In [None]:
len(x_test)

In [None]:
x_test

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
lr=LogisticRegression()

In [None]:
lr.fit(x_train,y_train)

In [None]:
lr.predict(x_test)

In [None]:
y_predictionss=lr.predict(x_test)

In [None]:
lr.predict([[23]]) #"no":"0","yes":"1"  #the person age of 23 will not take the insurance

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score
cmm = confusion_matrix(y_test, y_predictionss)
print ("Confusion Matrix:\n",cmm)

### 2. Multiclass Classification

In [None]:
df=pd.read_csv(r"C:\Users\ganig\OneDrive\Desktop\Dataset\iris_dataset.csv")

In [None]:
df

In [None]:
df

In [None]:
df.rename(columns={'target': 'species'}, inplace=True)

In [None]:
df["species"].unique()

In [None]:
df["species"].replace({'Iris-setosa':"1", 'Iris-versicolor':"2", 'Iris-virginica':"3"},inplace=True)

In [None]:
df

In [None]:
sns.pairplot(df,hue="species")
plt.show()

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test=train_test_split(df[["sepal length (cm)","sepal width (cm)","petal length (cm)","petal width (cm)"]],df["species"],test_size=0.2)

In [None]:
len(X_train)

In [None]:
len(X_test)

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
lr=LogisticRegression()

In [None]:
lr.fit(X_train,y_train)

In [None]:
lr.predict(X_test)

### Predict the "species name" based on sepal length,sepal width,petal length,petal width in cm

In [None]:
lr.predict([[5.1,3.8,1.9,0.4]]) #belongs to Iris-setosa

In [None]:
X_test   #'Iris-setosa':"1", 'Iris-versicolor':"2", 'Iris-virginica':"3"

## Score

In [None]:
print("Score:",lr.score(X_test,y_test))

## Accuracy

In [None]:
y_pred=lr.predict(X_test)

In [None]:
y_pred

In [None]:
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred))

# 3. Support Vector Machine

**Support Vector Machine** is a machine learning algorithm based on supervised learning , that can be used for both regression and classification. In general we use for classification. SVM finds the **hyperplane** between classes of data which **maximizes the margin between classes**.

In [None]:
import pandas as pd

In [None]:
data=pd.read_csv(r"C:\Users\ganig\OneDrive\Desktop\Dataset\iris_dataset.csv")

In [None]:
data.head()

In [None]:
sns.pairplot(data,hue="target")
plt.show()

From the above data **"target"** is the *dependent variable/target variable* and **"sepal length (cm)" "sepal width (cm)" "petal length (cm)" "petal width (cm)"** are *dependent variables*

In [None]:
c=data.iloc[:,:4]    #feature matrix

In [None]:
d=data.iloc[:,-1]    #dependent variable vector

In [None]:
c

In [None]:
d

### Data Splitting

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X1_train, X2_test, y1_train, y2_test = train_test_split(c,d,test_size=0.2,random_state=0)

In [None]:
len(X2_test)

In [None]:
len(X1_train)

In [None]:
len(y2_test)

In [None]:
len(y1_train)

In [None]:
from sklearn.svm import SVC #support vector classification

In [None]:
model=SVC(kernel="linear")

In [None]:
model.fit(X1_train,y1_train)

In [None]:
model.predict(X2_test)

In [None]:
X2_test.head()

## Score

In [None]:
model.score(X2_test,y2_test)

## Accuracy

In [None]:
y_predictions=model.predict(X2_test)

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(y2_test, y_predictions)

# 4. K-NN (K-Nearest Neighbourhood)

**KNN - K Nearest Neighbors** is one of the simplest *supervised machine learning* algorithm mostly used for **classification**. It classifies a data point based on how its neighbors are classified. KNN Algorithm is based on *feature similarity*. Choosing the right value of *k* is a process called **parameter tuning** and it is important for better accuracy.

**To choose a value of k:**
square root of n(the total number of data points). *Odd value* of k is selected to avoid confusion between two classes of data.

**To choose a neighbour:**
To find the nearest neighbors, we will calculate **Euclidean distance**.

In [None]:
DataFrame=pd.read_csv(r"C:\Users\ganig\OneDrive\Desktop\Dataset\diabetes.csv")

In [None]:
DataFrame.head()

**Objective:** Predict whether a person will be diagnosed with diabetes or not.

In [None]:
import pandas as pd
import numpy as np

In [None]:
len(DataFrame)

In [None]:
DataFrame.count()

In [None]:
DataFrame.isna().sum()   #To find the null values in the DataFrame

In [None]:
# Replace zeroes
zero_not_accepted = ['Glucose', 'BloodPressure', 'SkinThickness', 'BMI', 'Insulin']

In [None]:
for column in zero_not_accepted:
    DataFrame[column] = DataFrame[column].replace(0, np.NaN)
    mean = int(DataFrame[column].mean(skipna=True))
    DataFrame[column] = DataFrame[column].replace(np.NaN, mean)

In [None]:
sns.pairplot(DataFrame,hue="Outcome")
plt.show()

In [None]:
#Data Split
ind = DataFrame.iloc[:, 0:8]  #independent
dep = DataFrame.iloc[:, 8]    #dependent Variable
from sklearn.model_selection import train_test_split
a1_train, a2_test, b1_train, b2_test = train_test_split(ind, dep, random_state=0, test_size=0.2) #20% test data

In [None]:
a2_test.count()   #test_data

In [None]:
a1_train.count()   #train_data

In [None]:
#Feature scaling
from sklearn.preprocessing import StandardScaler
sc_a = StandardScaler()
a1_train = sc_a.fit_transform(a1_train)
a2_test = sc_a.transform(a2_test)

In [None]:
import math
#finding the total number of neighbors "k"
math.sqrt(len(b2_test))
#Note: K value should be odd number 

In [None]:
# Define the model: K-NN
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors=11, p=2,metric='euclidean')
#n_neighbors=11: This specifies the number of neighbors to consider when making predictions

# Training the Model
classifier.fit(a1_train, b1_train)

In [None]:
y_predict = classifier.predict(a2_test)
print(y_predict)

In [None]:
# Evaluate Model
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(b2_test, y_predict)
print ("Confusion Matrix:\n",cm)    
#\n is for New Line
#sns.heatmap(cm,annot=True)

In [None]:
from sklearn.metrics import f1_score
print("Score:",f1_score(b2_test, y_predict))

In [None]:
from sklearn.metrics import accuracy_score
print("Accuracy:",accuracy_score(b2_test, y_predict))

In [None]:
from sklearn.metrics import accuracy_score,mean_squared_error
print("Mean Square Error:",mean_squared_error(b2_test,y_predict))

In [None]:
from sklearn.metrics import classification_report
print(classification_report(b2_test, y_predict))

# Unsupervised Machine Learning

**Unsupervised Machine Learning** involves *extracting patterns or relationships from data without labeled outcomes* or explicit guidance, allowing the algorithm to identify inherent structures autonomously.

**Clustering** is a machine learning technique that involves grouping similar data points into clusters or subgroups *based on the similarity of their features*. The goal of clustering is to identify natural patterns or structures within the data, without any prior knowledge of the underlying categories or labels. 

# 5. Kmeans Clustering

**Kmeans** is a clustering algorithm used in machine learning to group data points into k distinct clusters based on their similarities, with the goal of **minimizing the variance**(spread or dispersion of set of values) within each cluster.

In [None]:
from sklearn.cluster import KMeans
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from matplotlib import pyplot as plt
%matplotlib inline

In [None]:
dt=pd.read_csv(r"C:\Users\ganig\OneDrive\Desktop\Dataset\income.csv")

In [None]:
dt.head()

In [None]:
plt.scatter(x="Age",y="Income($)",data=dt)
plt.xlabel("Age")
plt.ylabel("Income")
plt.show()

In [None]:
from sklearn.cluster import KMeans
km=KMeans(n_clusters=3)
km

In [None]:
y_prediction=km.fit_predict(dt[["Age","Income($)"]])
y_prediction

In [None]:
dt["Cluster"]=y_prediction  #Adding a custer column to dataframe to represent it visually

In [None]:
dt.head()

In [None]:
sns.pairplot(dt,hue="Cluster")
plt.show()

In [None]:
km.cluster_centers_

In [None]:
df1 = dt[dt.Cluster==0]
df2 = dt[dt.Cluster==1]
df3 = dt[dt.Cluster==2]
plt.scatter(df1.Age,df1['Income($)'],color='green')
plt.scatter(df2.Age,df2['Income($)'],color='red')
plt.scatter(df3.Age,df3['Income($)'],color='black')
plt.scatter(km.cluster_centers_[:,0],km.cluster_centers_[:,1],color='purple',marker='*',label='centroid')
plt.xlabel('Age')
plt.ylabel('Income ($)')
plt.legend()
plt.show()

In [None]:
scaler = MinMaxScaler()
scaler.fit(dt[['Income($)']])
dt['Income($)'] = scaler.transform(dt[['Income($)']])
scaler.fit(dt[['Age']])
dt['Age'] = scaler.transform(dt[['Age']])

In [None]:
dt.head()

In [None]:
km=KMeans(n_clusters=3)
km

In [None]:
y_prediction=km.fit_predict(dt[["Age","Income($)"]])
y_prediction

In [None]:
dt["Cluster"]=y_prediction

In [None]:
dt.head()

In [None]:
km.cluster_centers_

In [None]:
df1 = dt[dt.Cluster==0]
df2 = dt[dt.Cluster==1]
df3 = dt[dt.Cluster==2]
plt.scatter(df1.Age,df1['Income($)'],color='green')
plt.scatter(df2.Age,df2['Income($)'],color='red')
plt.scatter(df3.Age,df3['Income($)'],color='black')
plt.scatter(km.cluster_centers_[:,0],km.cluster_centers_[:,1],color='purple',marker='*',label='centroid')
plt.xlabel('Age')
plt.ylabel('Income($)')
plt.title('Clusters of Customers')
plt.legend()
plt.show()

### Elbow Plot

In [None]:
sse = []
k_rng = range(1,10)
for k in k_rng:
    km = KMeans(n_clusters=k)
    km.fit(dt[['Age','Income($)']])
    sse.append(km.inertia_)

In [None]:
print("Sum of Squared Errors:\n",sse)
#Note: As K value increases sse(sum of squared errors) decreases

In [None]:
k_rng

The **"elbow"** of the curve is often a good indicator of the optimal number of clusters. Choose the value of k at the point where the SSE starts decreasing at a slower rate, forming an elbow-like shape in the plot.

In [None]:
plt.plot(k_rng, sse, marker='s',color="red")
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Sum of Squared Errors (SSE)')
plt.title('Elbow Method for Optimal k')
plt.show()

**Note:** The elbow method shows that **3** is a good value for **K**

# Principal Component Analysis(PCA)/Dimensionality Reduction

**PCA** is a process of figuring out most important features or principal components that has the most impact on the target variable/dependent variable. PCA is called **dimensionality reduction technique** as it can help us reduce dimension, this is done to avoid the **curse of dimensionality**.

In [None]:
import pandas as pd
from sklearn.datasets import load_digits

In [None]:
Px=load_digits()
Px.keys()

In [None]:
Px.feature_names

In [None]:
Px.data.shape    #total we have 64 features in this data

In [None]:
Px.data[0].reshape(8,8)

In [None]:
from matplotlib import pyplot as plt
%matplotlib inline

plt.gray()
plt.matshow(Px.data[0].reshape(8,8))
plt.show()

In [None]:
plt.gray()
plt.matshow(Px.data[1].reshape(8,8))
plt.show()

In [None]:
Px.target

In [None]:
Px.target[1]

In [None]:
Pixel=pd.DataFrame(Px.data ,columns =Px.feature_names)
Pixel

In [None]:
Pixel.describe()

In [None]:
XPixel=Pixel
y=Px.target

In [None]:
from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()
X_scaled=scaler.fit_transform(XPixel)

In [None]:
X_scaled

In [None]:
from sklearn.model_selection import train_test_split
PX_train, PX_test, py_train, py_test = train_test_split(X_scaled, y, test_size=0.30, random_state=42)

In [None]:
from sklearn.linear_model import LogisticRegression

PXmodel=LogisticRegression()
PXmodel.fit(PX_train,py_train)
PXmodel.score(PX_test,py_test)

### Now, Lets Use PCA and find the accuracy of the model

In [None]:
Pixel.head()

In [None]:
from sklearn.decomposition import PCA

pca=PCA(0.95)  #to capture 95% of feature variance
x_pca=pca.fit_transform(Pixel)
x_pca.shape

**Note:** Here, By using the PCA we reduced the features from **64 to 29**

In [None]:
x_pca

In [None]:
pca.explained_variance_ratio_   #to print the no. of features we got using PCA

In [None]:
pca.n_components_   ##to print the no. of features we got using PCA

In [None]:
from sklearn.model_selection import train_test_split
XPCA_train, XPCA_test, py_train, py_test = train_test_split(x_pca, y, test_size=0.2, random_state=42)

In [None]:
from sklearn.linear_model import LogisticRegression

PXmodel=LogisticRegression(max_iter=1000)
PXmodel.fit(XPCA_train,py_train)
PXmodel.score(XPCA_test,py_test)

By using PCA we can reduce the budget...

In [None]:
pca=PCA(n_components=2)
x_pca=pca.fit_transform(Pixel)
x_pca.shape

In [None]:
x_pca

In [None]:
pca.explained_variance_ratio_

In [None]:
pca.n_components_

In [None]:
from sklearn.model_selection import train_test_split
XPCA_train, XPCA_test, py_train, py_test = train_test_split(x_pca, y, test_size=0.2, random_state=42)

In [None]:
from sklearn.linear_model import LogisticRegression

PXmodel=LogisticRegression(max_iter=1000)
PXmodel.fit(XPCA_train,py_train)
PXmodel.score(XPCA_test,py_test)