<a href="https://colab.research.google.com/github/pujaroy280/Break-Through-AI-Labs/blob/main/Puja_Roy_module4_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab 4 - Machine Learning 1 - Classification

This lab will introduce the basic concepts behind machine learning
and the tools that allow us to learn from data.

Machine learning is one of the main topics in modern AI and is used
for many exciting applications that we will see in the coming weeks.

![scikit](https://scikit-learn.org/stable/_images/sphx_glr_plot_classifier_comparison_001.png)

# Review

So far the two libraries that we have covered are Pandas (Week 2) which handles data frames.

In [None]:
import pandas as pd

And Altair (Week 3) which handles data visualization.

In [None]:
import altair as alt

We will continue building on these two libraries throughout the semester.

Let's start from a simple dataframe. We have been mainly loading
data frames from files, but we can also create them directly from python dictionaries.

In [None]:
d = {
    "City" : ["New York", "Philadelphia", "Boston"],
    "Temperature" : [25.3, 30.1, 22.1],
    "Location" : [20.0, 15.0, 25.0],
    "Population" : [10000000, 1000000, 500000],
}
df = pd.DataFrame(d)
df

Unnamed: 0,City,Temperature,Location,Population
0,New York,25.3,20.0,10000000
1,Philadelphia,30.1,15.0,1000000
2,Boston,22.1,25.0,500000


Here are our columns.

In [None]:
df.columns

Index(['City', 'Temperature', 'Location', 'Population'], dtype='object')

We can make a graph by converting our dataframe into a `Chart`.
Remember we do this in three steps.

**Charting**

1. Chart - Convert a dataframe to a chart
2. Mark - Determine which type of chart we want
3. Encode - Say which columns correspond to which dimensions

One example is a bar chart.

In [None]:
chart = (alt.Chart(df)
           .mark_bar()
           .encode(x = "City",
                   y = "Population"))
chart

Notice that we didn't have to use all the columns, and it only showed the ones we specified.

Another example is a chart that shows the location and the temperature.

In [None]:
chart = (alt.Chart(df)
           .mark_point()
           .encode(x = "Location",
                   y = "Temperature"))
chart

The library allows us to add special features. For instance, we can add a "Tooltip" where are mouse tells us which city it is.

In [None]:
chart = (alt.Chart(df)
           .mark_point(size=100, fill="yellow")
           .properties(title="Temperature By Location")
           .encode(x = "Location",
                   y = "Temperature",
                   shape = "City",
                   color = "City",
                   tooltip = ["City", "Temperature"]
           ))
chart

## Review Exercise

Make a bar chart that shows each city with its temperature and a tooltip of the city name.

In [None]:
#📝📝📝📝 FILLME
chart = (alt.Chart(df)
           .mark_bar(size=100, fill="green")
           .properties(title="Temperature By Location")
           .encode(x = "Location",
                   y = "Temperature",
                   shape = "City",
                   color = "City",
                   tooltip = ["City", "Temperature"]
           ))
chart

# Unit A

## Machine Learning Data

For today's  class we  are going  to take a  break from  our climate
change data and work with a simplified set of starter data.

Our dataset is a Red versus Blue classification challenge. Let us take a look.

In [None]:
df = pd.read_csv("https://srush.github.io/BT-AI/notebooks/simple.csv")
df

Unnamed: 0,class,split,feature1,feature2
0,blue,train,0.368232,0.447353
1,blue,train,0.574324,0.382358
2,red,train,0.799023,0.849630
3,blue,train,0.778323,0.104591
4,red,train,0.824153,0.989757
...,...,...,...,...
145,blue,test,0.259535,0.122557
146,red,test,0.937820,0.249618
147,blue,test,0.148987,0.700891
148,blue,test,0.531986,0.439514


The first thing to do is to look at the columns.

In [None]:
df.columns

Index(['class', 'split', 'feature1', 'feature2'], dtype='object')

Here is what the data looks like.

In [None]:
chart = (alt.Chart(df)
    .mark_point()
    .encode(
        x = "feature1:Q",
        y = "feature2:Q",
        color = "class:N",
        shape = "split:N",
        tooltip = "class:N"
    ))
chart

First is `split`.

In [None]:
splits = df["split"].unique()
splits

array(['train', 'test'], dtype=object)

The two options here are `train` and `test`. This is an important
distinction in machine learning.

* Train -> Points that we use to fit our machine learning model.
* Test ->  Points that we use to predict with our machine learning model.

For example, if we were building a model for classifying types of
birds from images, our Train split might be pictures of birds from a
guide, whereas our Test split would be new pictures of birds in the
wild that we want to classify.

Let us separate these out using a filter.

In [None]:
df_train = df.loc[df["split"] == "train"]
df_test = df.loc[df["split"] == "test"]

Next is `class`.

In [None]:
classes = df_train["class"].unique()
classes

array(['blue', 'red'], dtype=object)

We can see there are two options, `red` and `blue`.
This tells us the color associated with the point. For this exercise,
our goal is going to be splitting up these two colors.

Finally we have `features`.

In [None]:
features = df_train[["feature1", "feature2"]].describe()
features

Unnamed: 0,feature1,feature2
count,100.0,100.0
mean,0.508042,0.535118
std,0.269423,0.28528
min,0.005639,0.006402
25%,0.27492,0.296329
50%,0.568291,0.570836
75%,0.738461,0.81217
max,0.996007,0.990137


Features are the columns that we use
in order to solve the challenge. The machine learning model gets to
use the features in any way it wants it order to predict the class.

Let us now put everything together to draw a graph.

**Charting**

1. Chart - Just our training split.
2. Mark - Point mark to show each row
3. Encode - The features and the class.

In [None]:
chart = (alt.Chart(df_train)
    .mark_point()
    .encode(
        x = "feature1",
        y = "feature2",
        color = "class"
    ))
chart

We can see that for this example the colors are split into
two sides of the chart. Blue is at the bottom-left and red is
at the top-right.

We can also look at the test split. The test split consists of
the additional challenge points that our model needs to get correct.
These points will follow a similar pattern, but have different features.

In [None]:
chart = (alt.Chart(df_test)
    .mark_point()
    .encode(
        x = "feature1",
        y = "feature2",
        color = "class"
    ))
chart

## Prediction

We are interested in using the features to predict the class (red/blue).
We can do this by writing a function.

In [None]:
def predict(point):
    if point["feature1"] > 0.5:
        return "red"
    else:
        return "blue"

In [None]:
pt = {"feature1" : 0.3, "feature2" : 0.7}
predict(pt)

'blue'

We can apply this function using a variant of `map` from Module 1.
The `apply` command will call our prediction for each point in test.

In [None]:
df_test["predict"] = df_test.apply(predict, axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [None]:
df_test

Unnamed: 0,class,split,feature1,feature2,predict
100,blue,test,0.718804,0.075953,red
101,red,test,0.441072,0.82645,blue
102,red,test,0.82035,0.274501,red
103,red,test,0.620815,0.643749,red
104,blue,test,0.363353,0.596951,blue
105,blue,test,0.325463,0.216925,blue
106,red,test,0.465097,0.660846,blue
107,blue,test,0.194541,0.687478,blue
108,blue,test,0.250895,0.610575,blue
109,red,test,0.621408,0.567207,red


Once we have made predictions, we can compute a score for how well
our prediction did. We do this by comparing the `predict` with `class`.

In [None]:
correct = (df_test["predict"] ==  df_test["class"])
df_test["correct"] = correct

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Let us see how well we did. This graph puts everything together.

In [None]:
chart = (alt.Chart(df_test)
    .mark_point()
    .encode(
        x = "feature1",
        y = "feature2",
        color = "class",
        fill = "predict",
        shape = "correct",
        tooltip = ["correct"]
    ))
chart

The outline of the point is blue / red based on the true class. Whereas the fill tells us
our prediction. Mousing over the points will tell us whether they are correct or not.

👩‍🎓**Student question: How well did our predictions do?**

In [None]:
#📝📝📝📝 FILLME
df_test["correct"].mean()

0.76

In [None]:
out = df_test.groupby(["correct"], as_index=False).count()
chart = (alt.Chart(out)
            .mark_bar()
            .encode(
                x="correct",
                y="class"
                
            ))
chart

# Group Exercise A

## Question 0

Who are other members of your group today?

In [None]:
#📝📝📝📝 FILLME
Palak Shah, Sahar Sami and Aisha Bashir
dtypes = {
    "names": "Aisha Bashir", "Palak Shah", "Sahar Sami"
     }

SyntaxError: ignored

What is something they are great at cooking?

In [None]:
#📝📝📝📝 FILLME
dtypes = {
    "food": "indian food"
}

What are their favorite animals? 

In [None]:
#📝📝📝📝 FILLME
dtypes = {
    "favorite_animals": "labradors"
}

## Question 1

The `predict` function above is not able to fully separate the points into red/blue groups.
Can you write a new function that gets all of the points correct?

In [None]:
#📝📝📝📝 FILLME
def my_predict(point):
    if point['feature1'] + point['feature2'] > 1:
      return "red"
      return "blue"
df_test["predict"] = df_test.apply(predict, axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Redraw the graph above to show that you split up the points correctly.

In [None]:
#📝📝📝📝 FILLME
chart = (alt.Chart(df_test)
    .mark_point()
    .encode(
        x = "feature1",
        y = "feature2",
        color = "class",
        fill = "predict",
    ))
chart

## Question 2

The dataset above is a bit easy. It seems like you can just
separate the points with a line.

Next let us consider a harder example where the red and blue points form a circle.

In [None]:
df2 = pd.read_csv("https://srush.github.io/BT-AI/notebooks/circle.csv")

In [None]:
df2_train = df2.loc[df2["split"] == "train"]
df2_test = df2.loc[df2["split"] == "test"]

Draw a chart with these points.

In [None]:
#📝📝📝📝 FILLME
chart = (alt.Chart(df2)
      .mark_point()
      .encode(
          x = "feature1",
          y = "feature2",
          color = "class",
          shape = "split")
      )
chart

## Question 3

Try to write a function that separates the blue and the red
points. How well can you do?

In [None]:
#📝📝📝📝 FILLME
def my_circle_predict(point):
  if (point['feature1'] - 0.4)**2 + (point['feauture2']-0.4)**2 > 0.05:
    return "red"
    return "blue"
    pass
df2_test["predict"] = df2_test.apply(predict, axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys


Redraw the graph above to show that you split up the points correctly.

In [None]:
#📝📝📝📝 FILLME
chart = (alt.Chart(df2_test)
    .mark_point()
    .encode(
        x = "feature1",
        y = "feature2",
        color = "class",
        fill = "predict",
    ))
chart

# Unit B

Machine learning (ML) is a collection of method for automatically
finding a way to split up points. It sounds really simple, but
it turns out that splitting up points like this is really useful.

To explore this idea we are going to use the library Scikit-Learn.  This is 
a standard toolkit for machine learning in Python.

![sklearn](https://scikit-learn.org/stable/_static/scikit-learn-logo-small.png)

One warning. The documentation for Scikit-Learn is a bit intimidating. If you look something
up it might appear like this.

https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression

Do not be scared though. Most of these options do not matter so much in practice. You can
learn the important parts in 30 minutes.

Let us first import the library.

In [None]:
import sklearn.linear_model

We are going to use this formula for all our machine learning.

**Model Fitting**

1. Dataframe. Create your training data (This part you are an expert in!)
2. Fit. Create a model and give it training features
3. Predict. Use the model on test data.

*Step 1.* Create out data. (We did this already).

In [None]:
df_train

Unnamed: 0,class,split,feature1,feature2
0,blue,train,0.368232,0.447353
1,blue,train,0.574324,0.382358
2,red,train,0.799023,0.849630
3,blue,train,0.778323,0.104591
4,red,train,0.824153,0.989757
...,...,...,...,...
95,blue,train,0.325493,0.517772
96,blue,train,0.328865,0.061240
97,red,train,0.792355,0.298344
98,red,train,0.850158,0.840475


*Step 2.* Create our model and fit it to data.

First we pick a model type. We will mostly use this one.

In [None]:
sklearn.linear_model.LogisticRegression

sklearn.linear_model._logistic.LogisticRegression

However I really hate the name *logistic regression*. So let us rename this function to what it really is.

In [None]:
LinearClassification = sklearn.linear_model.LogisticRegression
model = LinearClassification()

Then we tell it which features to use as input (X) and what it goal
is (y). Here we tell it to use `feature1` and `feature2` and to
predict whether the point is `red`.

In [None]:
model.fit(X=df_train[["feature1", "feature2"]],
          y=df_train["class"] == "red")

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

This is similar to Altair chart. Just tell it which columns to use.

*Step 3*. Predict. Once we have a model we can use it to predict the
output classes of our model. This replaces the part where we did it
manually.

In [None]:
df_test["predict"] = model.predict(df_test[["feature1", "feature2"]])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


We can see the graph that came out.

In [None]:
chart = (alt.Chart(df_test)
    .mark_point()
    .encode(
        x = "feature1",
        y = "feature2",
        color = "class",
        fill = "predict",
        tooltip = ["correct"]
    ))
chart

That's it! You have just built your first machine learning model.

## Details

What happened? How did the system know whether the output points
would be red or blue?

The key idea is that behind the scenes the model uses the training data
to learn a class for every possible point.

For instance, if we make up a feature value.

In [None]:
feature1 = 0.2
feature2 = 0.5

Our model will produce an output prediction.

In [None]:
predict = model.predict([[feature1, feature2]])
predict

array([False])

In fact, we can even see what the model would do for any point. This dataframe has all possible points.

In [None]:
all_df = pd.read_csv("https://srush.github.io/BT-AI/notebooks/all_points.csv")
chart = (alt.Chart(all_df)
    .mark_point()
    .encode(
        x = "feature1",
        y = "feature2",
    ))
chart

Let us see what our model would do on each of them.

In [None]:
all_df["predict"] = model.predict(all_df[["feature1", "feature2"]])

In [None]:
chart = (alt.Chart(all_df)
    .mark_point()
    .encode(
        x = "feature1",
        y = "feature2",
        color="predict",
        fill = "predict",
    ))
chart

This makes sense as it corresponds to the same line that we saw on our original chart.

In [None]:
chart2 = (alt.Chart(df_test)
    .mark_point(color="black")
    .encode(
        x = "feature1",
        y = "feature2",
        shape = "class",
    ))
chart = chart + chart2
chart

## Other Data.

So is machine learning magic? Can we just give any data
and have it learn a separator for us?

Well let's try the circle dataset.

In [None]:
chart = (alt.Chart(df2_train)
    .mark_point()
    .encode(
        x = "feature1",
        y = "feature2",
        color = "class",
    ))
chart

First we fit.

In [None]:
model.fit(X=df2_train[["feature1", "feature2"]],
          y=df2_train["class"] == "red")

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

Then we predict.

In [None]:
df2_test["predict"] = model.predict(df2_test[["feature1", "feature2"]])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [None]:
df2_test

Unnamed: 0,class,split,feature1,feature2,predict
500,blue,test,0.594712,0.418443,True
501,red,test,0.102121,0.847092,True
502,red,test,0.082418,0.47496,True
503,red,test,0.19724,0.780904,True
504,red,test,0.789593,0.297249,True
505,blue,test,0.462743,0.228536,True
506,red,test,0.126399,0.991937,True
507,blue,test,0.391861,0.713452,True
508,blue,test,0.633829,0.3648,True
509,blue,test,0.382835,0.433166,True


Finally we graph.

In [None]:
all_df["predict"] = model.predict(all_df[["feature1", "feature2"]])

In [None]:
chart = (alt.Chart(all_df)
    .mark_point()
    .encode(
        x = "feature1",
        y = "feature2",
        color="predict",
        fill = "predict",
    ))
chart

Unfortunately this result no good. The model did not learn about the circle.
In fact it learned something completely wrong.

We can debug the problem by looking at how we created our model.

This line of code, says create `Linear` model. Linear in this case
implies that the model can only use a line to split the points.

In [None]:
model = LinearClassification()

This model couldn't even learn about the circle if it wanted to.

In the group exercise you will explore some other possible models that can get around some of these limitations.

# Group Exercise B

## Question 1

The linear model we used above could only draw lines to seperate
red and blue. Let us consider a new model.

In [None]:
import sklearn.neighbors
neighbor_model = sklearn.neighbors.KNeighborsClassifier(1)

The neighbor model takes a different approach. Instead of
producing a line, it memorizes all the points in training and
predicts based on how close a test example is.

For this question, you should :

1. Fit the neighbor model to the circle data.
2. Predict on `all_df`.
3. Graph the resulting shape.

In [None]:
#📝📝📝📝 FILLME
neighbor_model.fit(X=df2_train[["feature1", "feature2"]],
          y=df2_train["class"] == "red")
all_df["predict"] = neighbor_model.predict(all_df[["feature1", "feature2"]])
chart = (alt.Chart(all_df)
    .mark_point()
    .encode(
        x = "feature1",
        y = "feature2",
        color="predict",
        fill = "predict",
    ))
chart

It will not be perfect but it should be much closer to the circle shape of the data.

## Question 2

So far all of our datasets have had 2 features. For this dataset there are three
features (`feature1`, `feature2`, `feature3`).

In [None]:
df3 = pd.read_csv("https://srush.github.io/BT-AI/notebooks/three.csv")

Split the dataset into train and test, and then fit the linear model
`model` to all three of these features.

In [None]:
#📝📝📝📝 FILLME
df3_train = df3[df3["split"] == "train"]
df3_test = df3[df3["split"] == "test"]
model3 = LinearClassification()
model3.fit(x=df3_train[["feature1", "feature2", "feature3"]], 
          y=df3_train["class"] == "red")

TypeError: ignored

How many points in test does the model get correct?

In [None]:
#📝📝📝📝 FILLME
df3_test["predict"] = model13.predict(df3_test[["feature1", "feature2", "feature3"]])

NameError: ignored

## Question 3

It turns out that for `df3` you only need two of the features to
acheive high accuracy. Make a graph for each pair of features (three
graphs total).

In [None]:
#📝📝📝📝 FILLME
chart1 = (alt.Chart(all_df)
    .mark_point()
    .encode(
        x = "feature1",
        y = "feature2",
        color="predict",
        fill = "predict",
    ))
chart2 + chart

Which are the two features that you need? Try fitting `model` to just those two.

In [None]:
#📝📝📝📝 FILLME
chart1 = (alt.Chart(all_df)
    .mark_point()
    .encode(
        x = "feature1",
        y = "feature2",
        color="predict",
        fill = "predict",
        model = 'predict'
    ))
chart2

