# AI Project
***
## Administration and Rules:

Greetings, this is the first project in our project series. Generally, projects are larger problem sets, and contain more open-ended questions. You will be given some general guidelines and are allowed to explore more freely.

* **Guideline** In your submitted version
    * This project is divided into 5 parts.
        * For each part, you will be given a separate deadline.
        * Upon the deadline of each part, a sample analysis of this previous part will be released for you as your starting point for the next parts. (Think of this as Checkpoints in gaming)
        * So, start early!

    
* **How to submit:**
    * Submit on Google Classroom

***
## Introduction
In this project, you will be using a dataset called "Ghouls Goblins and Ghosts"

### Context
After a month of making scientific observations and taking careful measurements, we’ve determined that 900 ghouls, ghosts, and goblins are infesting our halls and frightening our fellow teachers and students here at Pinghe School. When trying garlic, asking politely, and using reverse psychology didn't work, it became clear that machine learning is the only answer to banishing our unwanted guests.

![halloween-660x.png](https://i.loli.net/2020/05/24/WeOE4TFNJp3DQnm.png)

So now the hour has come to put the data we’ve collected in your hands. We’ve managed to identify 371 of the ghastly creatures, but need your help to vanquish the rest. And only an accurate classification algorithm can thwart them. Use bone length measurements, severity of rot, extent of soullessness, and other characteristics to distinguish (and extinguish) the intruders. Are you ghost-busters up for the challenge?

### Dataset
Dataset file:
* Train: https://drive.google.com/file/d/1A7kgIjEruZv3qWbNl5kWzP5xRQbZmou5/view?usp=sharing
* Test: https://drive.google.com/file/d/1UhYjXWwH5L4BzUPWB9Iq5wO05_2SyMdm/view?usp=sharing

File descriptions
* `train.csv` - the training set, which contains both features and labels (target variables)
* `test.csv` - the test set, which contains only features and your job is to predict the types

Data fields
* id - id of the creature
* bone_length - average length of bone in the creature, normalized between 0 and 1
* rotting_flesh - percentage of rotting flesh in the creature
* hair_length - average hair length, normalized between 0 and 1
* has_soul - percentage of soul in the creature
* color - dominant color of the creature: 'white','black','clear','blue','green','blood'
* type - target variable: 'Ghost', 'Goblin', and 'Ghoul'

### Goal
Predict the types of the spooky creatures!

***
## Project Milestone 0 - Starter

Make sure you can run this starter

In [2]:
# Import libraries
import numpy as np
import pandas as pd

In google colab, loading data is a little bit different, we need to first mount the drive

In [3]:
# Mount the drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Save a shortcut copy of the dataset onto your own drive
* Data: https://drive.google.com/file/d/1A7kgIjEruZv3qWbNl5kWzP5xRQbZmou5/view?usp=sharing

In [4]:
# Load dataset

df = pd.read_csv('/content/drive/MyDrive/train.csv')
df.head()

Unnamed: 0,id,bone_length,rotting_flesh,hair_length,has_soul,color,type
0,0,0.354512,0.350839,0.465761,0.781142,clear,Ghoul
1,1,0.57556,0.425868,0.531401,0.439899,green,Goblin
2,2,0.467875,0.35433,0.811616,0.791225,black,Ghoul
3,4,0.776652,0.508723,0.636766,0.884464,black,Ghoul
4,5,0.566117,0.875862,0.418594,0.636438,green,Ghost


## Congratulations to have successfully run the Project Starter!

***
## Project Milestone #1 (31 points)


This is the first portion of the project, you are asked to explore the dataset using the tools we have learned in class so far.

### Question 1.1 (7 points)
Get started:
* Import the libraries **(2 point)**
* Import the data: both the training set and the test set **(2 point)**
* Take a quick look at the dataset, are there any missing values? **(3 points)**

In [5]:
# Import the libraries, expand the list as needed
import numpy as np
import pandas as pd


#### Import the training set**(1 point)**
Dataset file:
* Train: https://drive.google.com/file/d/1A7kgIjEruZv3qWbNl5kWzP5xRQbZmou5/view?usp=sharing

In [6]:
# Mount the drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [9]:
# Read in the training set
df = pd.read_csv('/content/drive/MyDrive/train.csv')


# print the shapes of the training set
print(df.shape)


# show the head of the training set
df.head()

(371, 7)


Unnamed: 0,id,bone_length,rotting_flesh,hair_length,has_soul,color,type
0,0,0.354512,0.350839,0.465761,0.781142,clear,Ghoul
1,1,0.57556,0.425868,0.531401,0.439899,green,Goblin
2,2,0.467875,0.35433,0.811616,0.791225,black,Ghoul
3,4,0.776652,0.508723,0.636766,0.884464,black,Ghoul
4,5,0.566117,0.875862,0.418594,0.636438,green,Ghost


In [11]:
# Let's check if there are any missing Values
np.any(df.isna()) 

False

**Answer:** Our data set does not have any missing nor null/na values. We are good to proceed.

### Question 1.2 **(12 points)**
* What types of data are the features? Which are Quantitative and which are Qualitative? **(2 points)**
* For the qualitative feature(s), perform a One-hot-encoding to transform it into separate columns for each category. Augment these new columns into the your `DataFrame` as new features. **(6 points)**
* Drop features that you think are either irrelavant or redundant and store features and types into separate variables (for both train and test set), so 3 variables in total (you do not observe targets for the test set). **(4 points)**

#### What types of data are the features? Which are Quantitative and which are Qualitative? **(2 points)**

**Answer:** From our previous exploration, the `Color` variable is categorical while the other features are all quantitative. Next, let's encode the `'Color'` feature before visualization. Right now, we will use a LabelEncoder to change the colors from text to integers of 0, 1, 2, 3, 4, 5

#### For the qualitative feature(s), perform a One-hot-encoding to transform it into separate columns for each category. Augment these new columns into the your `DataFrame` as new features. **(6 points)**

In [13]:
# Let's do a one-hot encoding for the Color variable
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
enc=le.fit(df['color'].unique())
df['color_encode']=enc.transform(df["color"]) 


In [None]:
# use pd.concat to get augmented DataFrame
pd.concatv

**Answer:** we have finished one-hot-encoding for the `Color` variable

####  Drop features that you think are either irrelavant or redundant and store features and types into separate variables, so you should have two variables ready `train_features` and `train_target`. **(4 points)**

In [None]:
# First define your target variable y




Your training features `train_features` should look like this
![image.png](https://i.loli.net/2021/04/29/PjbAqUVOzvcRDX2.png)

In [None]:
# Let's drop the 'id' and 'color' column as it doesn't help our classification



**Answer:**
* We have finished processing our feature variables for the training set.
* We have also stored training targets into a separate variable.

### Question 1.3 **(12 points)**
In this part of the project, let's treat this as separate binary classification problems first. Suppose that now we are only interested in identifying all the `Ghosts`. Then effectively, we can think of the problem as having just two classes: Ghosts and Non-Ghosts.

Repeat this process for each type and we effectively have done the so-called One-vs-all multi-class classification method.


Let's first perform a Logistic Regression for the `Ghost` vs `Other` problem. Follow the steps:
* Transform your target variables into 0s and 1s - 1 for `Ghost`, 0 for `Other` **(4 Points)**
* Run a Logistic Regression **(8 Points)**
    * Pick accuracy as the model evaluation metrics
    * Construct a validation set / method, explain why you pick this validation method.
    * Calculate and output the evaluation metrics on the validation set / method.

#### Transform your target variables into 0s and 1s - 1 for `Ghost`, 0 for `Other` **(4 Points)**

Make a new target variable called `train_target_ghost`, where
* 1: if type = 'Ghost'
* 0: otherwise

**Hint: you can use Selection method in Pandas to achieve this**

Your `train_target_ghost` should look like this:

![image.png](https://i.loli.net/2021/04/29/gAhWN1rVEToxkDX.png)

In [None]:
# Next we need to process the target variables.
# Note that we now only care about whether the creature is a Ghost or not - Binary Classification
# Hint: you can use Selection method in Pandas to achieve this



**Answer:** we have transformed targets into 0 and 1 with `Ghost` being the positive case.

#### Split the training set into training vs validation set using the train_test split method

You should have 4 variables ready:
* X_train
* X_valid
* y_train_ghost
* y_valid_ghost

In [None]:
# split the dataset 



#### Run a Logistic Regression **(8 Points)**
* Pick accuracy as the model evaluation metrics
* Calculate and output the evaluation metrics on the validation set / method.

In [None]:
# Run a standard logistic regression



# Make model predictions on the validation set



# Let's pick accuracy for our metrics
# calculate accuracy on the validation data

