# Stoneburner, Kurt
- ## DSC 540 - Week 05/06
- ## Chapter 5, Activity7

In this activity you are given the Wikipedia page where we have the GDP of all countries listed and you are asked to create three data frames from the three sources mentioned in the page ( link - https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal) )

You will have to -

- Open the page in a separate chrome/firefox tab and use something like `inspect element` tool to see the source HTML and understand the structure
- Read the page using bs4
- Find the table structure you will need to deal with (how many tables are there)
- Find the right table using bs4
- Separate the Source Names and their corresponding data
- Get the source names from the list of sources you have created
- Seperate the header and data from the data that you separated before. For the first source only. And then create a DataFrame using that
- Repeat the last task for the other two data sources.

In [1]:
# //****************************************************************************************
# //*** Set Working Directory to thinkstats folder.
# //*** This pseudo-relative path call should work on all Stoneburner localized projects. 
# //****************************************************************************************

import os
import sys
# //*** Imports and Load Data
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
#//*** Going to use the requests library, since it's the same library used for API calls
import requests

Resource:
- https://generalistprogrammer.com/python/python-web-scraping-tutorial-with-beautifulsoup-and-requests/


In [2]:
#//*** Use Requests to get the Wikipedia page
url = "https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)"
response = requests.get(url)

In [3]:
#//*** Verify Response is ok. This *should* be analogous to checking if the response code is 200
if response.ok == True: 
    #//*** Make soup...Beautiful Soup
    soup = BeautifulSoup(response.content,'html.parser')
else:
    print("Problem with the URL Request")
    
#//*** The tables all have the class name wikitable
#//*** Discard the first table, which is the container for the other three.
tables = soup.find_all('table', class_='wikitable')[1:]

In [10]:
#//*** Use the tableCounter to keep track of which table we are working in.
#//*** If we were super cool, we'd tie something in to using the first row of the maintable to keep track of 
#//*** Of which table/dataframe is which. But since we are looking for very specific things, it doesn't make
#//*** sense to invest in a more robust structure, since any slight change of source will break the whole scrape.

#//*** Personally, I'd skip Beautiful Soup and just use Regex. Mostly, because regex is universal and be applied
#//*** to other scenarios as well as other languages. It is is very useful across many tasks. I've also written 
#//*** an HTML parser in javascript using HTML. It was for a project where news talent reads scripts off of an iPad
#//*** instead of paper scripts.

#//*** The book uses a very pythonic single line method to generate a multi dimensional array. I'm not a fan of 
#//*** that in general. I prefer more readable verbose code.

#//*** We'll deviate from the assignment a bit by handling all three tables in a loop.

#//*** tableCounter helps keep track of which of the three tables we are working in. This is required
#//*** when determining which dataframe to build
tableCounter = 0

#//*** Parse Each table
for table in tables:
    tableCounter+=1
    
    #//************************************
    #//*** Build Table Headers
    #//************************************
    #//*** Get the Table Headers. These will be our data frame Columns.
    ths = table.find_all("th")
    #//*** initialize a list to hold the column names
    colnames = []
    
    #//*** Columnnames are the first value contained in contents
    for th in ths:
        colnames.append(th.contents[0])
    
    #//**********************************
    #//*** Initialize tableDict.
    #//**********************************
    #//*** tableDict is a dictionary container to hold row data.
    #//*** The tableDict will hold each of the row lists. The keys will be each colname
    tableDict = {}
    
    #//*** Initialize tableDict
    for name in colnames:
        tableDict[name] = []
    
    #//***********************************************
    #//*** Process each tablerow
    #//*** The sausage is primarily made here
    #//***********************************************
    
    #//*** Get a BS list of table rows
    trs = table.find_all("tr")
    
    #//*** For each table row in tablerows
    for tr in trs:
        #//*** Skip the table header
        if len(tr.find_all("th")) == 0:
            #//*** Loop through the colnames Index
            #//*** The gets the key value to store the TD data
            #//*** Get a TD with a corresponding index value and extract the text
            for x in range(0,len(colnames)):
                #//*** Append the text to the appropriate colname list.
                #//*** Using index values keeps everthing aligned.
                tableDict[colnames[x]].append(tr.find_all('td')[x].text.replace("\n",""))
    
    #//**************************************************************************************
    #//*** Table is fully parsed into the tableDict
    #//*** Remove the first element of each list. It contains the World Summary numbers
    #//**************************************************************************************
    for key in tableDict.keys():
        tableDict[key].pop(0)
        
    #//*********************************************************
    #//*** Convert tableDict into a df.
    #//*** the individual df is determined by the tableCounter
    #//*********************************************************
    if tableCounter == 1:
        #//*** Create the IMF Dataframe
        imf_df = pd.DataFrame()
        
        #//*** Add each Column to dataframe
        for x in colnames:
            imf_df[x] = tableDict[x]
    
    elif tableCounter == 2:
        #//*** Create the IMF Dataframe
        worldbank_df = pd.DataFrame()
        
        #//*** Add each Column to dataframe
        for x in colnames:
            worldbank_df[x] = tableDict[x]
        
    elif tableCounter == 3:
        #//*** Create the IMF Dataframe
        un_df = pd.DataFrame()
        
        #//*** Add each Column to dataframe
        for x in colnames:
            un_df[x] = tableDict[x]
    
#//*********************************************************
#//*** END table in tables
#//*********************************************************
        


In [11]:
        print("\n#############")
        print("IMF")
        print("#############")
        print(imf_df.head())

        print("\n#############")
        print("World Bank")
        print("#############")
        print(worldbank_df.head())
        
        print("\n#############")
        print("UN")
        print("#############")
        print(un_df.head())



#############
IMF
#############
  Rank Country/Territory         GDP
0    1     United States  20,807,269
1    2   China[n 2][n 3]  14,860,775
2    3             Japan   4,910,580
3    4           Germany   3,780,553
4    5    United Kingdom   2,638,296

#############
World Bank
#############
  Rank Country/Territory         GDP
0    1     United States  21,427,700
1    2        China[n 9]  14,342,903
2    3             Japan   5,081,770
3    4           Germany   3,845,630
4    5             India   2,875,142

#############
UN
#############
  Rank Country/Territory         GDP
0    1     United States  21,433,226
1    2        China[n 9]  14,342,933
2    3             Japan   5,082,465
3    4           Germany   3,861,123
4    5             India   2,891,582


# Chapter 6, Activity 08

In this activity we do the following

* Create a data frame from a given CSV
* Check for duplicates in the columns that matter
* Check for NaN in the columns that matter
* Apply our domain knowledge to single out and remove outliers
* Generate nice print statements as reports for differents steps

The data set is a 1000 row data set which represnets the traffic on a certain page of a website. The Names, email, and IP are faked out in order to keep the privacy

### Load the data (the file name is - visit_data.csv)

In [None]:
### Write your code bellow this comment

### Task - 1 (Are there duplicates?)

In [None]:
### Write your code bellow this comment

### Task - 2 (do any essential column contain NaN?)

In [None]:
### Write your code bellow this comment

### Task - 3 (Get rid of the outliers)

Consider what are the essential columns if you are preparing this dataset for a model building exercise where the target is to predict number of visits given a user name, email, IP address, Gender etc.

In [None]:
### Write your code bellow this comment

### Task - 4 (Report the size difference)

The `shape` method of a data frame gives you a tuple which represents (row, column) of the data frame, in this task you have to compare and report the number of rows before and after getting rid of the outliers

In [None]:
### Write your code bellow this comment

### Task - 5 (Box plot visit to further check any Outliers)

In [None]:
### Write your code bellow this comment

 Insert data into a SQL Lite database – create a table with the following data (Hint: Python for Data Analysis page 191):

a. Name, Address, City, State, Zip, Phone Number

b. Add at least 10 rows of data and submit your code with a query generating your results.

In [None]:
# //*** CODE HERE