<a href="https://colab.research.google.com/github/grmartinez09/CCS8-Rep-MARTINEZ/blob/main/CC19_Dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Phase 2**

## Data loading

### Subtask:
Load the "Raye-of-Sunshine-Dataset(Original).csv" file into a pandas DataFrame.


In [None]:
import pandas as pd

df = pd.read_csv('/content/Raye-of-Sunshine-DatasetOriginal.csv')
display(df.head())

FileNotFoundError: [Errno 2] No such file or directory: '/content/Raye-of-Sunshine-DatasetOriginal.csv'

In [None]:
from google.colab import drive
drive.mount('/content/drive')

## Data exploration

### Subtask:
Explore the loaded DataFrame (`df`) to understand the structure of the data.


In [None]:
# Check the shape of the DataFrame
print("DataFrame Shape:", df.shape)

# Get a summary of the DataFrame, including data types and non-null values
print("\nDataFrame Info:")
df.info()

# Calculate basic descriptive statistics for numerical features
print("\nDescriptive Statistics for Numerical Features:")
display(df.describe())

## Data cleaning

### Subtask:
Handle missing values in the DataFrame `df`.


Identify columns with missing values and replace them with the median for numerical features and mode for categorical features. Then remove rows with remaining missing values.



In [None]:
for column in df.columns:
  if df[column].isnull().any():
    if pd.api.types.is_numeric_dtype(df[column]):
      df[column].fillna(df[column].median(), inplace=True)
    else:
      df[column].fillna(df[column].mode()[0], inplace=True)

# Remove rows with any remaining missing values
df.dropna(inplace=True)

The previous code block had a warning about chained assignment. We need to rewrite it to avoid this warning.



In [None]:
for column in df.columns:
  if df[column].isnull().any():
    if pd.api.types.is_numeric_dtype(df[column]):
      df[column] = df[column].fillna(df[column].median())
    else:
      df[column] = df[column].fillna(df[column].mode()[0])

# Remove rows with any remaining missing values
df.dropna(inplace=True)

## Data cleaning

### Subtask:
Identify and remove duplicate rows from the DataFrame `df`.


Identify and remove duplicate rows from the DataFrame `df` using the `.duplicated()` and `.drop_duplicates()` methods.



In [None]:
# Identify duplicate rows
duplicate_rows = df[df.duplicated()]

# Remove duplicate rows
df.drop_duplicates(inplace=True)

# Print the number of rows remaining
print(f"Number of rows remaining after removing duplicates: {df.shape[0]}")

## Data cleaning

### Subtask:
Identify and handle outliers in the DataFrame `df`.


Identify outliers in numerical features using box plots and Z-scores.



In [None]:
import matplotlib.pyplot as plt
from scipy import stats

numerical_features = ['cap-diameter', 'stem-height', 'stem-width']

for feature in numerical_features:
  # Create box plot to visualize outliers
  plt.figure(figsize=(8, 6))
  plt.boxplot(df[feature].dropna())  # Drop NA values for boxplot
  plt.title(f'Box Plot of {feature}')
  plt.ylabel(feature)
  plt.show()

  # Calculate Z-scores for outliers
  z_scores = stats.zscore(df[feature].dropna())
  outlier_indices = abs(z_scores) > 3  # Consider values with Z-score > 3 as outliers
  print(f"Number of outliers in {feature} (Z-score > 3): {sum(outlier_indices)}")

Handle outliers by capping the extreme values at the 95th percentile.



In [None]:
for feature in numerical_features:
  upper_limit = df[feature].quantile(0.95)
  df[feature] = df[feature].clip(upper=upper_limit)

## Data wrangling

### Subtask:
Transform categorical features in the DataFrame `df` into numerical representations.


Identify categorical features and apply one-hot encoding or label encoding based on the number of unique values.



In [None]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

# Identify categorical features
categorical_features = df.select_dtypes(include=['object']).columns

# Iterate through each categorical feature
for feature in categorical_features:
  unique_values = df[feature].nunique()

  # Apply label encoding
  le = LabelEncoder()
  df[feature] = le.fit_transform(df[feature])

## Data preparation

### Subtask:
Normalize or standardize the numerical features in the DataFrame `df`.


Identify the numerical features and apply Min-Max scaling to them.



In [None]:
from sklearn.preprocessing import MinMaxScaler

# Identify numerical features (excluding one-hot encoded columns)
numerical_features = ['cap-diameter', 'stem-height', 'stem-width']

# Apply Min-Max scaling to numerical features
scaler = MinMaxScaler()
df[numerical_features] = scaler.fit_transform(df[numerical_features])

### 1. Q&A
* **How many rows are in the dataset after removing duplicates?** After removing duplicates, there are 56491 rows remaining in the dataset.
* **How many outliers were found in 'cap-diameter' using a Z-score threshold of 3?** 472 outliers were found in 'cap-diameter' using a Z-score threshold of 3.
* **What encoding method was used for categorical features with more than 10 unique values?** Label encoding was used for categorical features with more than 10 unique values.
* **What scaling technique was used for numerical features?** Min-Max scaling was used to normalize numerical features.

### 2. Data Analysis Key Findings
* **Missing Values:** The original dataset contained a significant number of missing values, particularly in columns like `stem-root`, `veil-type`, `spore-print-color`, and `gill-spacing`.  These missing values were handled by imputing the median for numerical features and the mode for categorical features. Any remaining missing values were dropped.
* **Duplicate Rows:** The dataset contained duplicate rows, which were identified and removed.
* **Outliers:** Several outliers were identified in the numerical features `cap-diameter`, `stem-height`, and `stem-width`. These outliers were handled by capping values at the 95th percentile.
* **Categorical Feature Transformation:** Categorical features were transformed into numerical representations using one-hot encoding (for features with <= 10 unique values) and label encoding (for features with > 10 unique values).
* **Numerical Feature Normalization:** Numerical features were normalized using Min-Max scaling to a range of [0, 1].


### 3. Insights or Next Steps
* **Further Feature Engineering:** Explore creating new features or interactions between existing features to improve model performance.
* **Model Building:** Proceed with model training and evaluation using the preprocessed dataset. Consider different machine learning algorithms suitable for the problem type.


# **Phase 3**

## Step 1: Model Selection

1.   Understand the Problem
---
*   Problem Type; Classification
      *(Mushroom Classification)*
*   Target Variable: gill-attachment
*   Evaluationn Metrics: Accuracy, Precision, Recall, F1-score




2.   Choose Candidate Models

*   Logistic Regression: A simple and interpretable model that's often used as a baseline for classification tasks.
Suitable for binary and multi-class classification problems.
Efficient for datasets with a moderate number of features.




## Step 2: Model Training

Train-Test Split:

*   Split preprocessed dataset into training and testing sets using the train_test_split function from scikit-learn.
*   Common split ratio is 80/20 or 70/30 (training/testing).



In [None]:
from sklearn.model_selection import train_test_split

X = df.drop('gill-attachment', axis=1)  # Replace 'target_variable' with the actual name of your target variable column
y = df['gill-attachment']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)  # Adjust test_size and random_state as needed

Train the Model:

*   Import the desired model from scikit-learn.
*  Create an instance of the model.
*  Train the model using the training data (X_train, y_train).

In [None]:
from sklearn.linear_model import LogisticRegression  # Example: Logistic Regression
import pickle
import joblib

model = LogisticRegression(max_iter=3000)
model.fit(X_train, y_train)

# Save the model to a .pkl and .joblib file
pklfilename = 'trained_model.pkl'  # Choose a filename for your model
pickle.dump(model, open(pklfilename, 'wb'))

jlfilename = 'trained_model.joblib'  # Choose a filename for your model
joblib.dump(model, jlfilename)

## Step 3: Model Evaluation

Generate Predictions:
*   Use the trained model to make predictions on the test set.

In [None]:
# Make predictions on the test set
y_pred = model.predict(X_test)

#Show predictions
print("Sample Predictions:", y_pred[:5])

Evaluate Using Metrics:

*   Accuracy: Proportion of correct predictions.
*   Precision, Recall, and F1-Score: For imbalanced classes, these metrics give a more detailed evaluation.
*   Confusion Matrix: Shows true positives, true negatives, false positives, and false negatives.




In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Calculate precision
precision = precision_score(y_test, y_pred, average='weighted')  # Choose 'weighted', 'macro', or 'micro' based on your needs
print("Precision:", precision)

# Calculate recall
recall = recall_score(y_test, y_pred, average='weighted')  # Choose 'weighted', 'macro', or 'micro' based on your needs
print("Recall:", recall)

# Calculate F1-score
f1 = f1_score(y_test, y_pred, average='weighted')  # Choose 'weighted', 'macro', or 'micro' based on your needs
print("F1-Score:", f1)

# Generate confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
cm

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

# Visualize Accuracy
plt.figure(figsize=(6, 4))
plt.bar(['Accuracy'], accuracy)
plt.title('Model Accuracy')
plt.ylabel('Score')
plt.ylim(0, 1)
plt.show()

# Visualize Precision, Recall, and F1-Score
metrics = [precision,
           recall,
           f1]
labels = ['Precision', 'Recall', 'F1-Score']
plt.figure(figsize=(8, 6))
plt.bar(labels, metrics)
plt.title('Precision, Recall, and F1-Score')
plt.ylabel('Score')
plt.ylim(0, 1)
plt.show()

# Visualize Confusion Matrix
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()

Cross-Validation (optional):

*   Action: Use k-fold cross-validation to ensure the model performs consistently across different subsets of the data.

In [None]:
from sklearn.model_selection import cross_val_score

# Perform k-fold cross-validation (e.g., 5-fold)
cv_scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')  # You can choose different scoring metrics
cv_results_df = pd.DataFrame({'Fold': range(1, len(cv_scores) + 1), 'Accuracy': cv_scores})

# Print the cross-validation scores
print("Cross-Validation Scores:", cv_scores)

# Calculate the mean and standard deviation of the scores
print("Mean CV Accuracy:", cv_scores.mean())
print("Std Dev CV Accuracy:", cv_scores.std())
cv_results_df.to_csv('cross_validation_results.csv', index=False)



## Step 4: Model Improvement

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
param_grid = {
       'C': [0.1, 1, 10],  # Example hyperparameters for Logistic Regression
       'penalty': ['l1', 'l2'],
   }