# IF3070 Foundations of Artificial Intelligence | Tugas Kecil 2

Group Number: 64

Group Members:
- Nathaniel Liady (18222114)
- Gabriel Marcellino (18222115)

## Import Libraries

In [None]:
import pandas as pd
import numpy as np

# Import other libraries if needed
import seaborn as sns
import matplotlib.pyplot as plt


# Additional settings
pd.set_option('display.max_columns',None)

## Import Dataset

In [None]:
# Example of reading a csv file from a gdrive link

# Take the file id from the gdrive file url
# https://drive.google.com/file/d/1ZUtiaty9RPXhpz5F2Sy3dFPHF4YIt5iU/view?usp=sharing => The file id is 1ZUtiaty9RPXhpz5F2Sy3dFPHF4YIt5iU
# and then put it in this format:
# https://drive.google.com/uc?id={file_id}
# Don't forget to change the access to public

# # Example
# df = pd.read_csv('https://drive.google.com/uc?id=1ZUtiaty9RPXhpz5F2Sy3dFPHF4YIt5iU')
# df.head()

df = pd.read_csv('https://drive.google.com/uc?id=15pnRBoG8nJRxJx3Bp8tOneZEB1XmHCYe')
# df = pd.read_csv('train.csv')


In [None]:
df.head(10)

In [None]:
df_copy = df.copy()

# Additional Step

## 1. Change Format Type

There are some incorrect data type (boolean but it says float). So in here, we'll change to desired data type

In [None]:
# See all columns
df_copy.columns

In [None]:
bool_columns = [
    'IsDomainIP','HasObfuscation','IsHTTPS','HasTitle','HasFavicon','IsResponsive','HasDescription','Robots','HasHiddenFields','HasPasswordField',
    'HasExternalFormSubmit','HasSocialNet','HasSubmitButton','HasCopyrightInfo','Crypto','Pay','Bank'
    ]

df_copy[bool_columns]

In [None]:
# Before change
df_copy[bool_columns].dtypes

In [None]:
# Change the value to boolean but keep the missing values
def change_bool(value):
    if pd.isna(value):
        return value
    else:
        return bool(value)

for column in bool_columns:
    df_copy[column] = df_copy[column].apply(change_bool).astype('bool')

In [None]:
# After Change
df_copy.dtypes

In [None]:
df = df_copy.copy()

# 1. Exploratory Data Analysis

Exploratory Data Analysis (EDA) is a crucial step in the data analysis process that involves examining and visualizing data sets to uncover patterns, trends, anomalies, and insights. It is the first step before applying more advanced statistical and machine learning techniques. EDA helps you to gain a deep understanding of the data you are working with, allowing you to make informed decisions and formulate hypotheses for further analysis.

## A. Data Understanding
The objective of this section is for participants to understand the quality of the provided data. This includes:

1. Data Size
2. Statistics of Each Feature
3. Outliers
4. Correlation
5. Distribution

### Step 1

Find the following:

1. The size of the data (instances and features).
2. The data types of each feature.
3. The number of unique values for categorical features.
4. The minimum, maximum, mean, median, and standard deviation values for non-categorical features.
5. Explain the significance of gathering the first four pieces of information.

#### 1. Data Size

In [None]:
# data size
print(f"data row: {df.shape[0]} rows")
print(f"data column: {df.shape[1]} columns")

In [None]:
# columns
df.columns

#### 2. Data Types

In [None]:
# data types of each feature
df.dtypes

#### 3. Unique Values

In [None]:
# the number of unique values

df_copy.nunique().sort_values(ascending=True)

#### 4. Statistical Informations

In [None]:
# Statistical Informations

# Take Numerical Categories
num_categories = df.select_dtypes(include='number')

In [None]:
# Numerical Columns
num_categories.columns

In [None]:
num_categories.describe()

#### 5. Purposes for doing the 4 steps

- Data Size and Data Types: <br>
    When we know the data size, it can helps us to understand amount of space of the data we are working on. It helps us to distinct proportional when splitting the data. For the data types in each features, it helps us understand which value that can be use in certain analysis (e.g. statistical informations from numerical data type)
 <br>

- Unique Values: <br>
  To understand how much (unique) values in each features. We can identify continuous or categorical data based on amount of unique values each features
<br>

- Statistical Informations: <br>
  It can help on finding statistical semantic each features based on mean, standard deviation (how far the values from the mean), quartile (amount of values under each quartile), min max values. With this informations, we could detect outliers or anomalies on the data ( one example: max in URLLength beyond the 75% quartile,which is 30).

<br>

### Step 2

Find the following:

1. Missing values for each feature.
2. Outliers for each feature (use the methods you are familiar with).
3. Why is it necessary to identify missing values and outliers?

#### 1. Missing Values

percentage missing values from all of the data

In [None]:
missing_percentage = ((df.isna().sum() * 100 / len(df)).round(4))
df_missing_percentage = missing_percentage.to_frame(name='Missing Values (%)').sort_values(by='Missing Values (%)',ascending=True)

print(df_missing_percentage)

#### 2. Outliers

In [None]:
# Outliers plot without id and label
num_categories_exc = num_categories.select_dtypes(exclude='int64')
fig,axs = plt.subplots(ncols=8,nrows=4,figsize=(32,16))
axs = axs.flatten()

for i, col in enumerate(num_categories_exc.columns):
    sns.boxplot(data=df[col], ax=axs[i])
    axs[i].set_title(col)
    axs[i].set_xlabel('')
    axs[i].set_ylabel('')
plt.tight_layout()
plt.show()

#### 3. Explanation why need to find missing values and outliers

- Missing Values: <br>
  Missing values could leading to bad model predict. With information the amount of the missing values, it helps us to distinct whether we are going to keep the features by give the missing values a new value (feature imputer) or will remove it from the dataset
<br>

- Outliers: <br>
  This is a anomalie value. By semantically correct, it doesn't fit to the value space which can disrupt how the model read the data. Same with the missing values, we can do a modification in the next step

### Step 3

Find the following:

1. Correlations between features.
2. Visualize the distribution of each feature (categorical and continuous).
3. Visualize the correlation between features and the target variable.
4. Explain the significance of understanding feature distributions and correlations.

#### 1. Correlations

In [None]:
corr = num_categories.corr(method='pearson')
fig
plt.figure(figsize=(20,16))
sns.heatmap(corr,annot=True,fmt=".2f",linewidths=.5,cmap='YlOrBr')
plt.title('Correlation Matrix',weight='bold')
plt.show()

#### 2. Distribution
note: we are not plot all of the data and only take half from all of the data. <br>
this action is to reduce computational works

In [None]:
# Plot only sample
# to reduce computational works
df_plot = df.sample(n=int(len(df)/2),random_state=42)

In [None]:
# Function to plot distribution

def plot_distribution(data,label='label',type='categorical'):
    '''
    Plot all numerical features: either categorical or continuous by label

    data    : dataframe
    label   : label column
    '''
    if type =='categorical':
        features = data.select_dtypes(include=['object','bool']).columns
    elif type == 'continuous':
        features = data.select_dtypes(include=['number']).columns


    n_cols = 4
    n_rows = int(np.ceil(len(features)/n_cols))

    fig,axs = plt.subplots(n_rows,n_cols,figsize=(20,5 * n_rows))
    axs = axs.flatten()

    for i,col in enumerate(features):
        sns.histplot(data=data,x=col,hue = label,ax=axs[i])
        axs[i].set_title(f"Distribution of {col}")
        axs[i].set_ylabel("Frequency")

    plt.tight_layout()
    plt.show()



##### 2.1. Distribution (Categorical)

In [None]:
categorical_features = [
    'IsDomainIP', 'HasObfuscation', 'IsHTTPS', 'HasTitle', 'HasFavicon',
    'HasDescription', 'HasPasswordField', 'HasExternalFormSubmit',
    'Bank', 'Pay', 'Crypto', 'HasSocialNet', 'Robots',
    'IsResponsive', 'label'
]

In [None]:
plot_distribution(df_plot[categorical_features],type='categorical')

##### 2.2. Distribution (Continuous)

In [None]:
continuous_features = [
    'URLLength', 'DomainLength', 'CharContinuationRate', 'TLDLegitimateProb',
    'URLCharProb', 'TLDLength', 'NoOfSubDomain', 'NoOfObfuscatedChar',
    'ObfuscationRatio', 'NoOfLettersInURL', 'LetterRatioInURL', 'NoOfDegitsInURL',
    'DegitRatioInURL', 'NoOfEqualsInURL', 'NoOfQMarkInURL', 'NoOfAmpersandInURL',
    'NoOfOtherSpecialCharsInURL', 'SpacialCharRatioInURL', 'LineOfCode',
    'LargestLineLength', 'DomainTitleMatchScore', 'URLTitleMatchScore', 'NoOfPopup',
    'NoOfiFrame', 'NoOfImage', 'NoOfCSS', 'NoOfJS', 'NoOfSelfRef', 'NoOfEmptyRef',
    'NoOfExternalRef','label'
]

In [None]:
plot_distribution(df_plot[continuous_features],type='continuous')

#### 3. Visualization features and target

In [None]:
def plot_by_label(features,y='label'):
  '''
  Function to plot each features by target (label)

  features: features column
  y: label
  '''

  nrows = 3
  ncols = int(np.ceil(len(features) + 1)/ 2)

  fig,axs = plt.subplots(ncols=ncols,nrows=nrows,figsize=(10,5*nrows))

  axs = axs.flatten()
  pallete = sns.color_palette(palette="tab20b")
  for i,col in enumerate(features):
    sns.scatterplot(data = features,x=col,y=y,colormaps=pallete)

    plt.title(f"Features {col} by label")
    plt.tight_layout()
    plt.show()


In [None]:
df.columns

In [None]:
# Features to plot

features_to_plot = [
        'URLLength',  'DomainLength',
       'IsDomainIP', 'TLD', 'CharContinuationRate', 'TLDLegitimateProb',
       'URLCharProb', 'TLDLength', 'NoOfSubDomain', 'HasObfuscation',
       'NoOfObfuscatedChar', 'ObfuscationRatio', 'NoOfLettersInURL',
       'LetterRatioInURL', 'NoOfDegitsInURL', 'DegitRatioInURL',
       'NoOfEqualsInURL', 'NoOfQMarkInURL', 'NoOfAmpersandInURL',
       'NoOfOtherSpecialCharsInURL', 'SpacialCharRatioInURL', 'IsHTTPS',
       'LineOfCode', 'LargestLineLength', 'HasTitle', 'Title',
       'DomainTitleMatchScore', 'URLTitleMatchScore', 'HasFavicon', 'Robots',
       'IsResponsive', 'NoOfURLRedirect', 'NoOfSelfRedirect', 'HasDescription',
       'NoOfPopup', 'NoOfiFrame', 'HasExternalFormSubmit', 'HasSocialNet',
       'HasSubmitButton', 'HasHiddenFields', 'HasPasswordField', 'Bank', 'Pay',
       'Crypto', 'HasCopyrightInfo', 'NoOfImage', 'NoOfCSS', 'NoOfJS',
       'NoOfSelfRef', 'NoOfEmptyRef', 'NoOfExternalRef', 'label'
       ]

In [None]:
plot_by_label(df_plot[bool_columns])

In [None]:
categorical_features

In [None]:
continuous_features

In [None]:
# plot_by_label(features= df_plot[features_to_plot])

#### 4. Explanation for feature distributions and correlations

## B. Data Insights

The objective of this section is for participants to understand how to formulate and get many insights from the given data so that we can improve the model performance. Given example questions:

1. How was the proportion between phishing and non-phishing URLs on security related features (`IsHTTPS` and `Robots`)?
2. Is there a significant correlation between the label of a URL (phishing or non-phishing) and its URL characteristics?
3. How does website-resource-related features vary across phishing and non-phishing URLs?

### Step 1

Answer the three example questions by visualizing and explaining the insights for each question. Add markdown texts to explain the visualizations.

#### 1. How was the proportion between phishing and non-phishing URLs on security related features (`IsHTTPS` and `Robots`)?

In [None]:
# Write your code here

#### 2. Is there a significant correlation between the label of a URL (phishing or non-phishing) and its URL characteristics?

URL Characteristics:
- `URLLength`
- `Domain`
- `DomainLength`
- `IsDomainIP`
- `TLD`
- `TLDLength`
- `NoOfSubDomain`
- `HasObfuscation`
- `NoOfObfuscatedChar`
- `ObfuscationRatio`
- `NoOfLettersInURL`
- `LetterRatioInURL`
- `NoOfDegitsInURL`
- `DegitRatioInURL`
- `NoOfEqualsInURL`
- `NoOfQMarkInURL`
- `NoOfAmpersandInURL`
- `NoOfOtherSpecialCharsInURL`
- `SpacialCharRatioInURL`
- `CharContinuationRate`

In [None]:
# Write your code here

#### 3. How does website-resource-related features vary across phishing and non-phishing URLs?

Website resource related features:
- `NoOfImage`
- `NoOfCSS`
- `NoOfJS`
- `NoOfSelfRef`
- `NoOfEmptyRef`
- `NoOfExternalRef`

In [None]:
# Write your code here

### Step 2

Try to formulate three other new questions and answer them with the methods used before.

#### 4. Your first question (replace this heading)

In [None]:
# Write your code here

#### 5. Your second question (replace this heading)

In [None]:
# Write your code here

#### 6. Your third question (replace this heading)

In [None]:
# Write your code here