# Web Scraping

## Beautiful Soup

In [1]:
import pandas as pd
from bs4 import BeautifulSoup
import requests

In [2]:
# Use requests to connect to the web page (url)
url = 'https://en.wikipedia.org/wiki/List_of_airlines_of_the_United_States'
response = requests.get(url).text # convert content to text

In [3]:
# Fetch info to bs4
b_soup = BeautifulSoup(response, 'lxml') #lxml is the html string parser

In [4]:
table = b_soup.findAll("table", {"class": "wikitable"})

In [5]:
table

[<table class="wikitable sortable" style="border: 0; cellpadding: 2; cellspacing: 3;">
 <tbody><tr style="vertical-align:middle;">
 <th>Airline
 </th>
 <th>Image
 </th>
 <th><a class="mw-redirect" href="/wiki/IATA_airline_designator" title="IATA airline designator">IATA</a>
 </th>
 <th><a class="mw-redirect" href="/wiki/ICAO_airline_designator" title="ICAO airline designator">ICAO</a>
 </th>
 <th><a href="/wiki/Call_sign#Aviation" title="Call sign">Callsign</a>
 </th>
 <th>Primary hubs, <br/> <i>secondary hubs</i>
 </th>
 <th>Founded
 </th>
 <th class="unsortable">Notes
 </th></tr>
 <tr>
 <td><a href="/wiki/Alaska_Airlines" title="Alaska Airlines">Alaska Airlines</a>
 </td>
 <td><span typeof="mw:File"><a class="mw-file-description" href="/wiki/File:N615AS_Alaska_Airlines_2000_Boeing_737-790_C_N_30344_(28850996478).jpg"><img class="mw-file-element" data-file-height="1079" data-file-width="1851" decoding="async" height="58" src="//upload.wikimedia.org/wikipedia/commons/thumb/f/f7/N615AS_

In [15]:
# Use pandas to process html table into DataFrame
df = pd.read_html(str(table[0]))[0]
# table[0]
df.head()

  df = pd.read_html(str(table[0]))[0]


Unnamed: 0,Airline,Image,IATA,ICAO,Callsign,"Primary hubs, secondary hubs",Founded,Notes
0,Alaska Airlines,,AS,ASA,ALASKA,Seattle/Tacoma Anchorage Portland (OR) San Fra...,1932,Founded as McGee Airways and commenced operati...
1,Allegiant Air,,G4,AAY,ALLEGIANT,Las Vegas Cincinnati Fort Walton Beach Indiana...,1997,Founded as WestJet Express and began operation...
2,American Airlines,,AA,AAL,AMERICAN,Dallas/Fort Worth Charlotte Chicago-O'Hare Los...,1926,Founded as American Airways and commenced oper...
3,Avelo Airlines,,XP,VXP,AVELO,Burbank New Haven Orlando Raleigh/Durham Wilmi...,1987,First did business as Casino Express Airlines ...
4,Breeze Airways,,MX,MXY,MOXY,Charleston Hartford New Orleans Norfolk Provid...,2018,Founded as Moxy Airways but was renamed due to...


# EDA

- Info about the size and scale of data:
  - shape()
  - info()
  - describe()
  - len()
  - columns
- Data Types
- Memory Consumption and Optimization (e.g. convert `int64` to `int8`)
- Determine if you have target in the data (supervised, unsupervised)

- Issues in the data:
  - Missing values (imputation)
  - Target balance/distribution (e.g. ratio between 0 and 1). Determine if we need to apply oversampling for imbalanced data (minority group)
  - Categorical data
  - Scaling
    - Starndardization: mean of 0 and standard deviation of 1 `StandardScaler()`
    - Normalization: data between 0 and 1 `MinMaxScaler()` (most used, compatible)
  - Encoding
    - Standard Encoding (One Hot Encoding)
    - Ordinal Encoding (features that have ordinal property)
  - Feature Extraction
  - Unnecessary columns

## Imputation

In [None]:
# Imputation 

# Categorical - mode (most frequent)
SimpleImputer(strategy='most_frequent')

# Numerical
# normally distributed (mean), imbalanced (median)
SimpleImputer(strategy='mean')
fillna(df['age'].mean(skipna=True))

# Advanced methods
KNNImputer() # Builds via KNN clutering

## Outlier Treatment

- IQR (InnerQuartile Range)
- Z Scores (define in standard deviations -- 2 or 2.5 standard deviations)
- Percentiles (<5th percentile, >95th percentile)
- Use anomonly detection to trim outliers

## Final Data Prep

- Splitting data for X and y (supervised learning)
- Split for training and testing

# Project 2 Clusters Evaluation

## Methods
1. EDA and visual evaluation: check which clutering algorithm is able to build clusters that represent the data properly
2. Silhouette Score (`silhouette_score`)
  - Values are ranged from -1 to 1
  - Closer to -1: Poorly clutered
  - Around 0: on the cluster border is good clustering
  - Closer to 1: well-clustered data