# Before your start:
- Read the README.md file
- Comment as much as you can and use the resources in the README.md file
- Happy learning!

In [None]:
# Import your libraries:

%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In this lab, we will explore a dataset that describes websites with different features and labels them either benign or malicious . We will use supervised learning algorithms to figure out what feature patterns malicious websites are likely to have and use our model to predict malicious websites.

# Challenge 1 - Explore The Dataset

Let's start by exploring the dataset. First load the data file:

In [None]:
websites = pd.read_csv('../website.csv')

#### Explore the data from an bird's-eye view.

You should already been very familiar with the procedures now so we won't provide the instructions step by step. Reflect on what you did in the previous labs and explore the dataset.

Things you'll be looking for:

* What the dataset looks like?
* What are the data types?
* Which columns contain the features of the websites?
* Which column contains the feature we will predict? What is the code standing for benign vs malicious websites?
* Do we need to transform any of the columns from categorical to ordinal values? If so what are these columns?

Feel free to add additional cells for your explorations. Make sure to comment what you find out.

In [None]:
# Your code here
display(websites.head(),websites.dtypes,websites.shape)

### Your comment here:

El dataset contiene 1781 registros y 21 atributos de los cuales 7 son object.

#### Next, evaluate if the columns in this dataset are strongly correlated.

In the Mushroom supervised learning lab we did recently, we mentioned we are concerned if our dataset has strongly correlated columns because if it is the case we need to choose certain ML algorithms instead of others. We need to evaluate this for our dataset now.

Luckily, most of the columns in this dataset are ordinal which makes things a lot easier for us. In the next cells below, evaluate the level of collinearity of the data.

We provide some general directions for you to consult in order to complete this step:

1. You will create a correlation matrix using the numeric columns in the dataset.

1. Create a heatmap using `seaborn` to visualize which columns have high collinearity.

1. Comment on which columns you might need to remove due to high collinearity.

In [None]:
# Your code here
object_columns = ['URL','CHARSET','SERVER','WHOIS_COUNTRY','WHOIS_STATEPRO','WHOIS_REGDATE','WHOIS_UPDATED_DATE']
websites_numeric1 = websites.drop(object_columns,axis=1)

import seaborn as sns
from matplotlib import pyplot
a4_dims = (11.7, 8.27)
fig, ax = pyplot.subplots(figsize=a4_dims)
sns.set()
ax = sns.heatmap(websites_numeric1.corr())

In [None]:
websites_numeric2 = websites_numeric1.drop('REMOTE_APP_PACKETS',axis=1)

import seaborn as sns
from matplotlib import pyplot
a4_dims = (11.7, 8.27)
fig, ax = pyplot.subplots(figsize=a4_dims)
sns.set()
ax = sns.heatmap(websites_numeric2.corr())

In [None]:
websites_numeric3 = websites_numeric2.drop('TCP_CONVERSATION_EXCHANGE',axis=1)

import seaborn as sns
from matplotlib import pyplot
a4_dims = (11.7, 8.27)
fig, ax = pyplot.subplots(figsize=a4_dims)
sns.set()
ax = sns.heatmap(websites_numeric3.corr())

### Your comment here

De la matrix de correlación se puede establecer que las columnas con mayor correlación y que por lo tanto se podrían eliminar por sospechar que pueden ser poco discriminantes son: 

REMOTE_APP_PACKETS

SOURCE_APP_PACKETS

TCP_CONVERSATION_EXCHANGE

NOTA: tengo mis dudas de como se debe interpretar la matriz de correlación.

# Challenge 2 - Remove Column Collinearity.

From the heatmap you created, you should have seen at least 3 columns that can be removed due to high collinearity. Remove these columns from the dataset.

Note that you should remove as few columns as you can. You don't have to remove all the columns at once. But instead, try removing one column, then produce the heatmap again to determine if additional columns should be removed. As long as the dataset no longer contains columns that are correlated for over 90%, you can stop. Also, keep in mind when two columns have high collinearity, you only need to remove one of them but not both.

In the cells below, remove as few columns as you can to eliminate the high collinearity in the dataset. Make sure to comment on your way so that the instructional team can learn about your thinking process which allows them to give feedback. At the end, print the heatmap again.

In [None]:
# Your code here
websites_numeric4 = websites_numeric3.drop('SOURCE_APP_PACKETS',axis=1)

### Your comment here

El análisis lo he hecho en el apartado anterior. He ido eliminando las columnas donde hay un par de columnas correlacionadas pero sin saber exactamente si esto es realmente coherente al no saber exactamente a que se refiere cada atributo.

In [None]:
# Print heatmap again
a4_dims = (11.7, 8.27)
fig, ax = pyplot.subplots(figsize=a4_dims)
sns.set()
ax = sns.heatmap(websites_numeric4.corr())

# Challenge 3 - Handle Missing Values

The next step would be handling missing values. **We start by examining the number of missing values in each column, which you will do in the next cell.**

In [None]:
# Your code here
websites.isna().sum()

If you remember in the previous labs, we drop a column if the column contains a high proportion of missing values. After dropping those problematic columns, we drop the rows with missing values.

#### In the cells below, handle the missing values from the dataset. Remember to comment the rationale of your decisions.

In [None]:
# Your code here
websites.drop('CONTENT_LENGTH',axis=1,inplace=True)
websites.dropna(inplace=True)

### Your comment here
Se elimina la columna CONTENT_LENGTH ya que posee un alto porcentaje de nulos (812 de 1781). Elimino también el la fila con valor nulo en la columna DNS_QUERY_TIMES.

#### Again, examine the number of missing values in each column. 

If all cleaned, proceed. Otherwise, go back and do more cleaning.

In [None]:
# Examine missing values in each column
websites.isna().sum()

# Challenge 4 - Handle `WHOIS_*` Categorical Data

There are several categorical columns we need to handle. These columns are:

* `URL`
* `CHARSET`
* `SERVER`
* `WHOIS_COUNTRY`
* `WHOIS_STATEPRO`
* `WHOIS_REGDATE`
* `WHOIS_UPDATED_DATE`

How to handle string columns is always case by case. Let's start by working on `WHOIS_COUNTRY`. Your steps are:

1. List out the unique values of `WHOIS_COUNTRY`.
1. Consolidate the country values with consistent country codes. For example, the following values refer to the same country and should use consistent country code:
    * `CY` and `Cyprus`
    * `US` and `us`
    * `SE` and `se`
    * `GB`, `United Kingdom`, and `[u'GB'; u'UK']`

#### In the cells below, fix the country values as intructed above.

In [None]:
# Your code here
websites['WHOIS_COUNTRY'].unique()

In [None]:
COUNTRIES = {'Cyprus':'CY','us':'US','se':'SE','United Kingdom':'GB',"[u'GB'; u'UK']":'GB'}
websites.replace({"WHOIS_COUNTRY": COUNTRIES},inplace=True)
websites['WHOIS_COUNTRY'].unique()

Since we have fixed the country values, can we convert this column to ordinal now?

Not yet. If you reflect on the previous labs how we handle categorical columns, you probably remember we ended up dropping a lot of those columns because there are too many unique values. Too many unique values in a column is not desirable in machine learning because it makes prediction inaccurate. But there are workarounds under certain conditions. One of the fixable conditions is:

#### If a limited number of values account for the majority of data, we can retain these top values and re-label all other rare values.

The `WHOIS_COUNTRY` column happens to be this case. You can verify it by print a bar chart of the `value_counts` in the next cell to verify:

In [None]:
# Your code here
websites['WHOIS_COUNTRY'].value_counts().plot(kind='bar',figsize=(15,15))

#### After verifying, now let's keep the top 10 values of the column and re-label other columns with `OTHER`.

In [None]:
websites['WHOIS_COUNTRY'].value_counts()

In [None]:
# Your code here
def country(x, countries=['US', 'CA', 'ES', 'UK', 'AU', 'PA', 'JP', 'IN', 'CN']):
    if x in countries:
        return x
    else:
        return 'OTHER'

websites['WHOIS_COUNTRY'] = websites['WHOIS_COUNTRY'].apply(lambda x: country(x))
websites.head()

In [None]:
websites['WHOIS_COUNTRY'].value_counts()

Now since `WHOIS_COUNTRY` has been re-labelled, we don't need `WHOIS_STATEPRO` any more because the values of the states or provinces may not be relevant any more. We'll drop this column.

In addition, we will also drop `WHOIS_REGDATE` and `WHOIS_UPDATED_DATE`. These are the registration and update dates of the website domains. Not of our concerns.

#### In the next cell, drop `['WHOIS_STATEPRO', 'WHOIS_REGDATE', 'WHOIS_UPDATED_DATE']`.

In [None]:
# Your code here
websites.drop(['WHOIS_STATEPRO', 'WHOIS_REGDATE', 'WHOIS_UPDATED_DATE'],axis=1,inplace=True)

# Challenge 5 - Handle Remaining Categorical Data & Convert to Ordinal

Now print the `dtypes` of the data again. Besides `WHOIS_COUNTRY` which we already fixed, there should be 3 categorical columns left: `URL`, `CHARSET`, and `SERVER`.

In [None]:
# Your code here
websites.dtypes

#### `URL` is easy. We'll simply drop it because it has too many unique values that there's no way for us to consolidate.

In [None]:
# Your code here
websites.drop('URL',axis=1,inplace=True)

#### Print the unique value counts of `CHARSET`. You see there are only a few unique values. So we can keep it as it is.

In [None]:
# Your code here
websites['CHARSET'].unique()

`SERVER` is a little more complicated. Print its unique values and think about how you can consolidate those values.

#### Before you think of your own solution, don't read the instructions that come next.

In [None]:
# Your code here
websites['SERVER'].unique()

![Think Hard](../think-hard.jpg)

### Your comment here
Se trata de una categoría que tiene muchos valores únicos pero que se pueden clusterizar creando pocas variables.

Although there are so many unique values in the `SERVER` column, there are actually only 3 main server types: `Microsoft`, `Apache`, and `nginx`. Just check if each `SERVER` value contains any of those server types and re-label them. For `SERVER` values that don't contain any of those substrings, label with `Other`.

At the end, your `SERVER` column should only contain 4 unique values: `Microsoft`, `Apache`, `nginx`, and `Other`.

In [None]:
# Your code here

def check_str(string):
    if 'Microsoft' in string:
        string = 'Microsoft'
        return string
    elif 'Apache' in string:
        string = 'Apache'
        return string
    elif 'nginx' in string:
        string = 'nginx'
        return string
    else:
        string = 'Other'
        return string
    
websites['SERVER'] = websites['SERVER'].apply(check_str)      

In [None]:
# Count `SERVER` value counts here
websites['SERVER'].unique()

OK, all our categorical data are fixed now. **Let's convert them to ordinal data using Pandas' `get_dummies` function ([documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html)).** Make sure you drop the categorical columns by passing `drop_first=True` to `get_dummies` as we don't need them any more. **Also, assign the data with dummie values to a new variable `website_dummy`.**

In [None]:
# Your code here
website_dummy = pd.get_dummies(websites, columns=['WHOIS_COUNTRY', 'CHARSET', 'SERVER'], drop_first=True)

Now, inspect `website_dummy` to make sure the data and types are intended - there shouldn't be any categorical columns at this point.

In [None]:
# Your code here
display(website_dummy.dtypes)
website_dummy.shape

# Challenge 6 - Modeling, Prediction, and Evaluation

We'll start off this section by splitting the data to train and test. **Name your 4 variables `X_train`, `X_test`, `y_train`, and `y_test`. Select 80% of the data for training and 20% for testing.**

In [None]:
website_dummy.head()

In [None]:
website_dummy.describe()

In [None]:
from sklearn.model_selection import train_test_split

# Your code here:


#### For this lab, we will opt to use SVM. 

Support Vector Machines, or SVM, is an algorithm that aims to draw a line or a plane between the two groups such that they are linearly separable and the distance from the observations of each group to the line or plane is maximized. The goal of the algorithm is to find the line or plane that separates the groups. You can read more about this algorithm [here](https://en.wikipedia.org/wiki/Support_vector_machine).

In the next cell, `svm` will be imported for you. **You will initialize the proper estimator, fit the training data, and predict the test data.**

The `sklearn.svm` class documentation can be found [here](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.svm). By reading the documentation and searching online, the question you'll need to answer is **which SVM estimator to use**? When you choose the estimator, keep the following in mind:

* Our data are categorical, not continuous.

* We have removed the correlated columns. All columns we have right now are independent.

If your statistical knowledge is not adequate at this moment, don't worry. Just play around and make an informed guess. We'll evaluate your prediction in the next step. If the prediction is unsatisfactory you can move back to this step to modify your estimator.

In [None]:
from sklearn import svm

# Your code here:


In the following cell, we'll show you how to compute the accuracy of your prediction. The output score will show you how often your classifier is correct. If you have used the proper estimator, your accuracy score should be over 0.9. However, if your accuracy score is unsatisfactory, go back to the previous step to try another estimator until you produce a satisfactory accuracy score.

In [None]:
#Importamos librerías
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier

In [None]:
#Creamos datasets para análisis de los modelos
cols = [col for col in website_dummy.columns.values if col != "Type"]
X = website_dummy[cols]
y = website_dummy["Type"]

In [None]:
#Función de regresión logistica con score (crossvalidation)
def LRs(X,y,ns):
    cls = LogisticRegression(solver='liblinear')
    #cls.fit(X,y)
    scores = cross_val_score(cls,X,y,cv=ns)
    #print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
    return scores.mean()


In [None]:
#LRs(X,y,5)

In [None]:
#Función de SVM con score (crossvalidation)
def SVMs(X,y,k,ns):
    cls = svm.SVC(kernel=k,probability=True)
    #cls.fit(X,y)
    scores = cross_val_score(cls,X,y,cv=ns)
    #print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
    return scores.mean()


In [None]:
#SVMs(X,y,'linear',5)

In [None]:
cls = svm.SVC(kernel='linear',probability=True)
scores = cross_val_score(cls,X,y,cv=5)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

In [None]:
#Función de RandomForest con score (crossvalidation)
def RFs(X,y,n,ns):
    cls = RandomForestClassifier(n_estimators=n)
    #cls.fit(X,y)
    scores = cross_val_score(cls,X,y,cv=ns)
    #print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
    return scores.mean()

In [None]:
#RFs(X,y,20,5)

In [None]:
svm_k = ['linear','poly','rbf','sigmoid','precomputed']
rf_e = [5,10,20]

In [None]:
def best_classifier(X,y,svm_k,rf_e,ns=5):
    classifiers = []
    classifiers.append(LRs(X,y,ns))
    for i in range(len(svm_k)):
        classifiers.append(SVMs(X,y,svm_k[i],ns))
    for e in range(len(rf_e)):
        classifiers.append(RFs(X,y,rf_e[i],ns))
    best = max(classifiers)
    if classifiers.index(best) == 0:
        name = 'Logistic Regression'
    elif 0 < classifiers.index(best) < 6:
        name = 'Support Vector Machines'
    elif classifiers.index(best) > 6:
        name = 'Random Forest'
    return print("Best classifier is a {} model with an accuracy of {}".format(name,best))

In [None]:
#Función de regresión logistica
def LR(X,y):
    cls = LogisticRegression(solver='liblinear')
    cls.fit(X,y)
    return print(cls.score(X,y))

In [None]:
# Computer prediction accuracy

from sklearn import metrics
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

# Bonus Challenge - Feature Scaling

Problem-solving in machine learning is iterative. You can improve your model prediction with various techniques (there is a sweetspot for the time you spend and the improvement you receive though). Now you've completed only one iteration of ML analysis. There are more iterations you can conduct to make improvements. In order to be able to do that, you will need deeper knowledge in statistics and master more data analysis techniques. In this bootcamp, we don't have time to achieve that advanced goal. But you will make constant efforts after the bootcamp to eventually get there.

However, now we do want you to learn one of the advanced techniques which is called *feature scaling*. The idea of feature scaling is to standardize/normalize the range of independent variables or features of the data. This can make the outliers more apparent so that you can remove them. This step needs to happen during Challenge 6 after you split the training and test data because you don't want to split the data again which makes it impossible to compare your results with and without feature scaling. For general concepts about feature scaling, click [here](https://en.wikipedia.org/wiki/Feature_scaling). To read deeper, click [here](https://medium.com/greyatom/why-how-and-when-to-scale-your-features-4b30ab09db5e).

In the next cell, attempt to improve your model prediction accuracy by means of feature scaling. A library you can utilize is `sklearn.preprocessing.RobustScaler` ([documentation](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html)). You'll use the `RobustScaler` to fit and transform your `X_train`, then transform `X_test`. You will use SVM to fit and predict your transformed data and obtain the accuracy score in the same way. Compare the accuracy score with your normalized data with the previous accuracy data. Is there an improvement?

In [None]:
# Your code here