<a href="https://colab.research.google.com/github/nallagondu/datatrained_inter_public/blob/main/Glass_Identification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Project Description : Glass Identification**


The dataset describes the chemical properties of glass and involves classifying samples of glass using their chemical properties as one of six classes. The dataset was credited to Vina Spiehler in 1987. The study of classification of types of glass was motivated by criminological investigation. At the scene of the crime, the glass left can be used as evidence...if it is correctly identified!


The chemical compositions are measured as the weight percent in corresponding oxide.
Attribute Information-
1.	 Id number: 1 to 214
2.	 RI: refractive index
3.	Na: Sodium (unit measurement: weight percent in corresponding oxide, as are attributes 4-10)
4.	Mg: Magnesium
5.	Al: Aluminum
6.	Si: Silicon
7.	K: Potassium
8.	Ca: Calcium
9.	Ba: Barium
10.	Fe: Iron
11.	Type of glass: (class attribute)


column_name_update =  ['iD','RI','Na','Mg','Al','Si','K','Ca','Ba','Fe','Type of glass']

•	1- building_windows_float_processed
•	2- building_windows_non_float_processed
•	3- vehicle_windows_float_processed
•	4- vehicle_windows_non_float_processed (none in this database)
•	5- containers
•	6- tableware
•	7- headlamps

There are 214 observations in the dataset. The dataset can be divided into window glass (classes 1-4) and non-window glass (classes 5-7).
Predict : Type of glass



Dataset Link-
•	https://github.com/FlipRoboTechnologies/ML-Datasets/blob/main/Glass%20Identification/Glass%20Identification.csv


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from io import StringIO
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix,classification_report,accuracy_score
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')

In [None]:
url = "https://raw.githubusercontent.com/FlipRoboTechnologies/ML-Datasets/main/Glass%20Identification/Glass%20Identification.csv"
data = StringIO(url)
column_name_update =  ['Id','RI','Na','Mg','Al','Si','K','Ca','Ba','Fe','Type of glass']
df = pd.read_csv(url,sep = ",", names = column_name_update)
df.set_index('Id',inplace = True)
df

In [None]:
df.info()

In [None]:
df.head()

In [None]:
df.shape

In [None]:
df.columns.tolist()

In [None]:
df.dtypes

In [None]:
df.isnull().sum()

No missing values in above data

In [None]:
sns.heatmap(df.isnull())

In [None]:
missing_values = df.isnull().sum()
missing_values

In [None]:
missing_values_data = df.isnull().sum().sort_values(ascending = False)
missing_values_data

In [None]:
df.fillna(df.mean(),inplace = True)

In [None]:
df.fillna(df.median(),inplace = True)
df.fillna(df.mode().iloc[0],inplace = True)

In [None]:
sns.heatmap(df.isnull())

In [None]:
df.head()

In [None]:
df.head()

In [None]:
corr = df.corr()
fig, ax = plt.subplots(figsize=(10,10))
sns.heatmap(corr,annot=True,ax=ax)

In [None]:
fig, ax = plt.subplots(figsize=(10,8))
plt.xticks(range(len(corr.columns)), corr.columns)
plt.yticks(range(len(corr.columns)), corr.columns)
sns.heatmap(corr,annot=True)
plt.show()

In [None]:
ax1 = sns.boxplot(x='Type of glass',y='RI',data=df)
print(df.groupby('Type of glass')['RI'].mean())

In [None]:
df.describe()

In [None]:
#Data disctubution of remaining  columns
#lets check how the data distributed remaining column
plt.figure(figsize = (10, 6), facecolor = "white")
plotnumber = 1
for col in df:
  if plotnumber<=8:
    ax = plt.subplot(4,2, plotnumber)
    sns.distplot(df[col],color = "m")
    plt.xlabel(col, fontsize = 12)
    plt.yticks(rotation = 0,fontsize = 10)
  plotnumber+=1
plt.tight_layout()

In [None]:
df.skew()

In [None]:
skew_df = df[columns].skew().to_frame().rename(columns={0:"skewness"})
skew_df

In [None]:
k_skewness = df['K'].skew()
Ca_skewness = df['Ca'].skew()
Ba_skewness = df['Ba'].skew()

print(k_skewness)
print(Ca_skewness)
print(Ba_skewness)

To handle skewness, especially in highly skewed data,we have to apply transformations:

Log Transformation: Apply a logarithmic transformation to reduce right skewness.

The **Yeo-Johnson transformation** is a statistical technique used to stabilize variance, make data more normally distributed, and improve the validity of measures of association. It’s a type of power transformation that can handle both positive and negative values, including zero, which makes it more flexible than some other transformations like the Box-Cox transformation that only handles positive values

Square Root Transformation: Apply a square root transformation to reduce moderate right skewness.

Box-Cox Transformation: A more flexible transformation that can handle both positive and negative skewness.

In [None]:
df_transformed = df.copy()

In [None]:
from scipy.stats import yeojohnson
import scipy.special as special
from scipy import stats

columns = df.columns.tolist()
columns.remove("Type of glass")

In [None]:
skewness_transformation = {}

for col in columns:
    transformed_log = np.log(df[col])  # Log Transformation
    transformed_boxcox = special.boxcox1p(df[col], 0.15)     # Box-Cox Transformation with lambda=0.15
    transformed_inverse = 1 / df[col]   # Inverse Transformation
    transformed_cbrt = np.cbrt(df[col]) # Cube Root Transformation
    transformed_yeojohnson, _ = stats.yeojohnson(df[col])
    transformation_skewness = {
        "Log Transformation": stats.skew(transformed_log),
        "Box-Cox Transformation": stats.skew(transformed_boxcox),
        "Inverse Transformation": stats.skew(transformed_inverse),
        "Cube Root Transformation": stats.skew(transformed_cbrt),
        "Yeo Johnson Transformation": stats.skew(transformed_yeojohnson)}

    skewness_transformation[col] = transformation_skewness

In [None]:
df_ref = pd.DataFrame.from_dict(skewness_transformation, orient= 'index')
df_ref = pd.concat([skew_df["skewness"],  df_ref], axis=1)
df_ref

In [None]:
for col in columns:
  transformed_col,_ = stats.yeojohnson(df[col])
  df[col]= transformed_col

In [None]:
df.sample(10)

In [None]:
skewed_features = df.skew().sort_values(ascending=False)
print(skewed_features)

In [None]:
df["Type of glass"].unique()

In [None]:
X = df.drop("Type of glass",axis = 1).values
y = df["Type of glass"].values.reshape(-1,1)

In [None]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.25,random_state = 42)

In [None]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

**Classificaion Models**

In [None]:
# Logistic regresssion
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

scaler = StandardScaler()
lr = LogisticRegression()

pipeline = Pipeline([('scaler', scaler), ('lr', lr)])
x = pd.DataFrame(scaler.fit_transform(X),columns = df.drop("Type of glass", axis =1).columns)
x

In [None]:
from sklearn.model_selection  import train_test_split
from sklearn.ensemble  import RandomForestClassifier
from sklearn.metrics  import accuracy_score

maxAccu = 0
maxRS = 0
for i in range(1, 200):
  X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.25,random_state = i)
  RFR = RandomForestClassifier()
  RFR.fit(X_train, y_train)
  pred = RFR.predict(X_test)
  acc = accuracy_score(y_test, pred)
  if acc>maxAccu:
    maxAccu = acc
    maxRS = i
print("best accuracy is ", maxAccu, "at random_state", maxRS)

In [None]:
from  sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import GradientBoostingClassifier ,AdaBoostClassifier,BaggingClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, accuracy_score
from sklearn.model_selection import cross_val_score


In [None]:


# Checking Accuracy for Random ForestClassifier
RFC = RandomForestClassifier()
RFC.fit(X_train,y_train)
predRFC = RFC.predict(X_test)
print(accuracy_score(y_test,predRFC))
print("Confusion Matric" , confusion_matrix(y_test,predRFC))
print(classification_report(y_test,predRFC))

**Logistic Regression**

In [None]:
# Checking Accuracy for Logistic Regression
LR = LogisticRegression()
LR.fit(X_train,y_train)
predLR = LR.predict(X_test)
print(accuracy_score(y_test,predLR))
print("Confusion Matric" , confusion_matrix(y_test,predLR))
print(classification_report(y_test,predLR))

In [None]:
#LR Classifier

from sklearn.metrics import confusion_matrix
sns.heatmap(confusion_matrix(y_test,predLR), annot=True,cmap = 'Blues')
plt.show()

In [None]:
svc  = SVC()
svc.fit(X_train,y_train)
predsvc = svc.predict(X_test)
print(accuracy_score(y_test,predsvc))
print("Confusion Matric" , confusion_matrix(y_test,predsvc))
print(classification_report(y_test,predsvc))

In [None]:
#SVC Classifier

from sklearn.metrics import confusion_matrix
sns.heatmap(confusion_matrix(y_test,predsvc), annot=True,cmap = 'Blues')
plt.show()

In [None]:
GB  = GradientBoostingClassifier()
GB.fit(X_train,y_train)
predGB = GB.predict(X_test)
print(accuracy_score(y_test,predGB))
print("Confusion Matric" , confusion_matrix(y_test,predGB))
print(classification_report(y_test,predGB))

In [None]:
from sklearn.metrics import confusion_matrix
sns.heatmap(confusion_matrix(y_test,predGB), annot=True,cmap = 'Blues')
plt.show()

In [None]:
#AdaBoost Classifier
ABC  = AdaBoostClassifier()
ABC.fit(X_train,y_train)
predABC = ABC.predict(X_test)
print(accuracy_score(y_test,predABC))
print("Confusion Matric" , confusion_matrix(y_test,predABC))
print(classification_report(y_test,predABC))

In [None]:
from sklearn.metrics import confusion_matrix
sns.heatmap(confusion_matrix(y_test,predABC), annot=True,cmap = 'Blues')
plt.show()