## Objective

Develop a brief understanding of the dataset with which you will be working, and load in the dataset and perform basic data inspection (how many features are there in the dataset, how many unique labels are there in the dataset, and so on).

## Workflow

1. Fix the seed of the random number generator numpy so that your project is reproducible as soon as you import them. If you are using separate notebooks, make sure you do this in those notebooks as well.

2. Look at the data and understand the features of the data. The dataset is in a .csv format, which can be loaded using the pandas library and the features (the columns of the dataset) can also be extracted using pandas itself. If you are looking for a quick introduction to pandas, I have provided a great resource for that in section 1.2.

3. Take a look at the first ten rows of the dataset. The dataset has a large number of features, so you might want to modify the output a bit to make it fit the screen.

4. Find out the exact dimensions of the dataset.

5. Find out how many unique labels are there in the dataset. The labels of the dataset are given in the Result column.

In [28]:
"""
1. Fix the seed of the random number generator numpy so that your project is 
reproducible as soon as you import them. If you are using separate notebooks, make sure you do this
in those notebooks as well.
"""

import numpy as np
np.random.seed(7483)

In [29]:
"""
2. Look at the data and understand the features of the data. The dataset is in a .csv format, 
which can be loaded using the pandas library and the features (the columns of the dataset) can 
lso be extracted using pandas itself. If you are looking for a quick introduction to pandas, I 
have provided a great resource for that in section 1.2.
"""

import pandas as pd


dataset_file_path = "../datasets/Phishing.csv"

phishing_df = pd.read_csv(dataset_file_path)
phishing_df.head() # This prints the first 5 rows by default with the columns moving left-to-right

Unnamed: 0,having_IP_Address,URL_Length,Shortining_Service,having_At_Symbol,double_slash_redirecting,Prefix_Suffix,having_Sub_Domain,SSLfinal_State,Domain_registeration_length,Favicon,...,popUpWidnow,Iframe,age_of_domain,DNSRecord,web_traffic,Page_Rank,Google_Index,Links_pointing_to_page,Statistical_report,Result
0,-1,1,1,1,-1,-1,-1,-1,-1,1,...,1,1,-1,-1,-1,-1,1,1,-1,-1
1,1,1,1,1,1,-1,0,1,-1,1,...,1,1,-1,-1,0,-1,1,1,1,-1
2,1,0,1,1,1,-1,-1,-1,-1,1,...,1,1,1,-1,1,-1,1,0,-1,-1
3,1,0,1,1,1,-1,-1,-1,1,1,...,1,1,-1,-1,1,-1,1,-1,1,-1
4,1,0,-1,1,1,-1,1,1,-1,1,...,-1,1,-1,-1,0,-1,1,1,1,1


In [30]:
""" (continued)
2. Look at the data and understand the features of the data. The dataset is in a .csv format, 
which can be loaded using the pandas library and the features (the columns of the dataset) can 
lso be extracted using pandas itself. If you are looking for a quick introduction to pandas, I 
have provided a great resource for that in section 1.2.
"""

# To extract the columns/features we can use the .keys() function available to the DataFrame class
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.keys.html
phishing_df.keys()
print(features)

Index(['having_IP_Address', 'URL_Length', 'Shortining_Service',
       'having_At_Symbol', 'double_slash_redirecting', 'Prefix_Suffix',
       'having_Sub_Domain', 'SSLfinal_State', 'Domain_registeration_length',
       'Favicon', 'port', 'HTTPS_token', 'Request_URL', 'URL_of_Anchor',
       'Links_in_tags', 'SFH', 'Submitting_to_email', 'Abnormal_URL',
       'Redirect', 'on_mouseover', 'RightClick', 'popUpWidnow', 'Iframe',
       'age_of_domain', 'DNSRecord', 'web_traffic', 'Page_Rank',
       'Google_Index', 'Links_pointing_to_page', 'Statistical_report',
       'Result'],
      dtype='object')


In [31]:
"""
3. Take a look at the first ten rows of the dataset. The dataset has a large number of features, 
so you might want to modify the output a bit to make it fit the screen.
"""

# Here we transpose the columns as rows, then limit our output to only the first 10 columns
phishing_df.transpose().iloc[:,:10]

# If we wanted to limit the transposed output to the first ten rows, we could use:
# phishing_df.transpose().head(10) # (this does not limit the columns)
# or 
# phishing_df.transpose().iloc[:10, :10]

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
having_IP_Address,-1,1,1,1,1,-1,1,1,1,1
URL_Length,1,1,0,0,0,0,0,0,0,1
Shortining_Service,1,1,1,1,-1,-1,-1,1,-1,-1
having_At_Symbol,1,1,1,1,1,1,1,1,1,1
double_slash_redirecting,-1,1,1,1,1,-1,1,1,1,1
Prefix_Suffix,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1
having_Sub_Domain,-1,0,-1,-1,1,1,-1,-1,1,-1
SSLfinal_State,-1,1,-1,-1,1,1,-1,-1,1,1
Domain_registeration_length,-1,-1,-1,1,-1,-1,1,1,-1,-1
Favicon,1,1,1,1,1,1,1,1,1,1


In [32]:
"""
4. Find out the exact dimensions of the dataset.
"""

# There are multiple ways to interpret this requirement, so I'm providing multiple answers.

print("The Phishing dataset contains:")
print(f"Total Elements: {phishing_df.size}")
print(f"Rows: {phishing_df.shape[0]} / Columns: {phishing_df.shape[1]}")
print(f"Number of Dimensions: {phishing_df.ndim}")

The Phishing dataset contains:
Total Elements: 342705
Rows: 11055 / Columns: 31
Number of Dimensions: 2


In [33]:
"""
5. Find out how many unique labels are there in the dataset. 
The labels of the dataset are given in the Result column.
"""

phishing_df['Result'].unique()

array([-1,  1])