# 06 Getting Data into Python

> "You don’t understand anything until you learn it more than one way". ~ Marvin Minsky

> "Wear your learning, like a watch, in a private pocket: and do not pull it out and strike it, merely to show that you have one." ~ Lord Chesterfield

![sankey](http://visualoop.com/media/2015/05/1.jpg)

**Source:** [Vallerio Pellegrini](https://www.behance.net/valeriopellegrini)

## Notebook Outline

1. Getting Data Into Python
2. Text
3. Excel
4. HTML

# 1. Getting Data into Python

![all the data](http://blogs.ubc.ca/coetoolbox/files/2014/03/meme-data-data-everywhere.png)

When working with data in Python you will encounter datasets coming in all shapes and formats, so it is crucial to understand how to deal with them in order to work with data. We will be covering the following 4 formats in this section (yes, there are only 4 here 😁):

- CSV --> Comma Separated Values --> `pd.read_csv(file, sep=',')`
- TSV --> Tab Separated Values --> `pd.read_csv(file, sep=' ')`
- Excel --> Microsoft Excel format (.xlsx) --> `pd.read_excel()`
- JSON --> JavaScript Object Notation --> `pd.read_json()`
- HTML --> Hypertext Markup Language --> `pd.read_html()`

For this part of the lesson, we will be using some real world datasets that you can find more info about (e.g. how to download them) in datasets directory one level above this one. Please download them and add them to the datasets directory for this course, or whichever directory you'd like to use.

![want the data](http://blog.charitydynamics.com/wp-content/uploads/2015/05/data-cat_2.png)

# 2. Text Files

Text files are extremely common among organisations, and hence, they will form a big part of the files you will encounter in your daily work. More specifically, files with a format such as **Comma** and **Tab** separated values, along with other text files that use different delimiters, might amount to half (if not more) of the files that you will see at work.

These two formats, **Comma** and **Tab**, are still only a text file but they are very useful for saving and distributing small and large datasets in a tabular way. You can identify both kinds of files by looking at the suffix part in the name of a file, comma separated values will end in `.csv` while tab separated values will end with `.tsv`.

What makes these two files so similar is that they are both separated by something called delimiter. If you have a CSV or TSV file, try opening them in a plain text editor application (notepad for windows users and ) and notice what comes up.

![csv](pictures/csv_file.png)

Notice that in the example above, every value is separated by a comma and although the column headers can be found at the very top of the file (this is common practice) sometimes you might not even have them available. When we save files as TSV, CSV or with any other kind of delimiter, words with spaces in them will be wrapped around quotation marks to differentiate the spaces from the delimiter (which might be a space itself) in the data.

Lastly, let's talk about how pandas handles these types of files. To read text files into a DataFrame with pandas we can use the function `pd.read_csv()` or `pd.read_table()`. These functions, at the time of writing, have over 50 parameters that allow us to customise different specification on how we would like to read in the data. One of the most important parameters is the `sep=`, which allows us to define the delimiter we would like to read in the data with. `","` is the default for `pd.read_csv()` and `"\t"` is the default for `pd.read_table()`.

The following parameters are some of the most useful ones not only for reading in text files, but also to tackle, and save time with, many of the operations you might need to perform after reading the data. Please visit the pandas documentation for more info.

- `header=` --> tells pandas whether the first column contains the headers of the dataframe or not.
- `names=[list, of, column, names]` --> allows us to explicitly name the columns of a dataframe in the order in which they are read.
- `parse_dates=` --> gives pandas permision to look for what might look like date data and it will assign it the appropriate date data type format.
- `index_col=` --> allows us to assign a specific column as the index of our dataframe.
- `skiprows=\[1, 2, 3, 4\]` --> tells pandas which rows we want to skip.
- `na_values=` --> takes in a list of values that might be missing and assigns them the NaN value, which stands for not a number.
- `encoding=` --> data might coming in from a variety of sources could have different encodings, e.g. 'UTF-8' or 'ASCII', and this parameter helps us specify which one we need for our data.
- `nrows=4` --> how many rows do you want to read from a file. Very useful tool for examining the first few lines of large files.

Let's use the Air Quality Monitoring Dataset and first read in the CSV file and then the TSV one.

In [1]:
import pandas as pd

The first argument is the name of the file as a string or the path to the folder where the data lives followed by the name of the file and its extension. Once you load the dataset and assign it to a variable, you can see its first 5 rows plus the column names using the method `.head()`, or the method `.tail()` for the last 5.

In [3]:
# The first argument is the folder where the data lives and the name of the data

df_csv = pd.read_csv('C:/Users/monch.mercader/Python/Data_Analytics/Module 1/datasets/files/seek_australia.csv')

In [4]:
df_csv.head()

Unnamed: 0,category,city,company_name,geo,job_board,job_description,job_title,job_type,post_date,salary_offered,state,url
0,Retail & Consumer Products,Sydney,Frontline Executive Retail Sydney,AU,seek,Have you had 10 years experience in fresh pro...,Store Manager - Fresh Produce,Full Time,2018-04-15T23:13:45Z,$100k Base + Super + Benefits,North Shore & Northern Beaches,https://www.seek.com.au/job/35989382
1,Government & Defence,Brisbane,Powerlink,AU,seek,The Opportunity: The Client Solution Analyst ...,Client Solution Analyst,Full Time,2018-04-15T23:04:40Z,Excellent remuneration packages,Northern Suburbs,https://www.seek.com.au/job/35989272
2,Trades & Services,Sydney,Richard Jay Laundry,AU,seek,An innovative business development role for a...,Service Technician / Installer - NSW,Full Time,2018-04-15T23:04:31Z,,Parramatta & Western Suburbs,https://www.seek.com.au/job/35989270
3,Trades & Services,Melbourne,Adaptalift Hyster,AU,seek,About the role: We are seeking an Automotive W...,Workshop Technician I Material Handling Equipment,Full Time,2018-04-16T03:15:17Z,,Bayside & South Eastern Suburbs,https://www.seek.com.au/job/35993203
4,Trades & Services,Adelaide,Bakers Delight G&M,AU,seek,Â Early starts and weekend shifts. No experie...,APPRENTICESHIP JUNIOR BAKER,Full Time,2018-04-16T01:26:50Z,,,https://www.seek.com.au/job/35991578


Notice how the look and feel of our file resembles that of a spreadsheet in Excel or Google Sheets.

To read in Tab Separated Value files, all we need to do is to pass in the `sep=` argument to our `pd.read_csv()` function and provide pandas with the specific delimiter the data is split by. For tab separated values we use `\t`, but there are many other delimiters one can choose from.

In [7]:
# air quality data in Australia
df_tsv = pd.read_csv('C:/Users/monch.mercader/Python/Data_Analytics/Module 1/datasets/files/Air_Quality_Monitoring_Data.tsv', sep='\t')
df_tsv.head()

Unnamed: 0,Name,GPS,DateTime,NO2,O3_1hr,O3_4hr,CO,PM10,PM2.5,AQI_CO,AQI_NO2,AQI_O3_1hr,AQI_O3_4hr,AQI_PM10,AQI_PM2.5,AQI_Site,Date,Time
0,Florey,"(-35.220606, 149.043539)",11/11/2020 04:00:00 PM,,,,,11.95,5.55,,,,,23.0,22.0,,11 November 2020,16:00:00
1,Monash,"(-35.418302, 149.094018)",11/11/2020 04:00:00 PM,,,,,12.99,5.42,,,,,25.0,21.0,,11 November 2020,16:00:00
2,Civic,"(-35.285307, 149.131579)",11/11/2020 04:00:00 PM,,,,,14.35,5.76,,,,,28.0,23.0,28.0,11 November 2020,16:00:00
3,Florey,"(-35.220606, 149.043539)",11/11/2020 05:00:00 PM,0.0,0.035,0.037,0.0,12.5,5.55,0.0,0.0,35.0,47.0,25.0,22.0,47.0,11 November 2020,17:00:00
4,Monash,"(-35.418302, 149.094018)",11/11/2020 05:00:00 PM,0.0,0.038,0.037,0.0,13.41,5.4,0.0,0.0,38.0,46.0,26.0,21.0,46.0,11 November 2020,17:00:00


There is another method in pandas that uses the Tab Separated Values delimiter `"\t"` as its default delimiter, and that is the `pd.read_table()` method. You should use whichever you prefer, especially since most of the options in one can be found in the other. This means that by indicating the `sep=','` with a comma, you can obtain the same result as with the `pd.read_csv()` and read in Comma Separated Values.

In [9]:
# occupational licenses data
df_table = pd.read_table('C:/Users/monch.mercader/Python/Data_Analytics/Module 1/datasets/files/occupational_licences.csv')
df_table.head()

Unnamed: 0,"LicenceNumber,LicenceCode,LicenceType,LicenseeOrg,liccount,lictotal,Licensees,TradingNames,Address,Suburb,State,Postcode,Dateexpiry,Dateissued"
0,"17721736,SEMP,""1A - patrol, guard, watch or pr..."
1,"17723761,SEMP,1C - act as a crowd controller,I..."
2,"18901138,EA,Licensed Employment Agent,Company,..."
3,"17723543,SEMP,""1A - patrol, guard, watch or pr..."
4,"18402498,RA,Licensed Real Estate Agents,Indivi..."


If we know a dataset has dates in it we can also convert it to the datetime format provided by pandas as we read in the data. We can do this by putting the names of the columns that have dates in the `parse_dates=` parameter of our `pd.read_csv()` function.

In [10]:
# The first argument is the folder where the data lives and the name of the data

df_csv = pd.read_csv('C:/Users/monch.mercader/Python/Data_Analytics/Module 1/datasets/files/Crashes_Last_Five_Years.csv', parse_dates=['ACCIDENT_DATE', 'ACCIDENT_TIME'])
df_csv.head()

Unnamed: 0,OBJECTID,ACCIDENT_NO,ABS_CODE,ACCIDENT_STATUS,ACCIDENT_DATE,ACCIDENT_TIME,ALCOHOLTIME,ACCIDENT_TYPE,DAY_OF_WEEK,DCA_CODE,...,DEG_URBAN_ALL,LGA_NAME_ALL,REGION_NAME_ALL,SRNS,SRNS_ALL,RMA,RMA_ALL,DIVIDED,DIVIDED_ALL,STAT_DIV_NAME
0,3401744,T20130013732,ABS to receive accident,Finished,2013-01-07,18.30.00,Yes,Struck Pedestrian,Monday,PED NEAR SIDE. PED HIT BY VEHICLE FROM THE RIGHT.,...,MELB_URBAN,MELBOURNE,METROPOLITAN NORTH WEST REGION,,,Local Road,Local Road,Undivided,Undiv,Metro
1,3401745,T20130013736,ABS to receive accident,Finished,2013-02-07,16.40.00,No,Collision with vehicle,Tuesday,PARKED VEHICLES ONLY,...,MELB_URBAN,WHITEHORSE,METROPOLITAN SOUTH EAST REGION,,,Arterial Other,"Arterial Other,Local Road",Divided,"Div,Undiv",Metro
2,3401746,T20130013737,ABS to receive accident,Finished,2013-02-07,13.15.00,No,Collision with a fixed object,Tuesday,RIGHT OFF CARRIAGEWAY INTO OBJECT/PARKED VEHICLE,...,MELB_URBAN,BRIMBANK,METROPOLITAN NORTH WEST REGION,,,Local Road,Local Road,Undivided,Undiv,Metro
3,3401747,T20130013738,ABS to receive accident,Finished,2013-02-07,16.45.00,No,Collision with a fixed object,Tuesday,RIGHT OFF CARRIAGEWAY INTO OBJECT/PARKED VEHICLE,...,RURAL_VICTORIA,MITCHELL,NORTHERN REGION,M,M,Freeway,Freeway,Divided,Div,Country
4,3401748,T20130013739,ABS to receive accident,Finished,2013-02-07,15.48.00,No,Collision with vehicle,Tuesday,U TURN,...,"MELBOURNE_CBD,MELB_URBAN",MELBOURNE,METROPOLITAN NORTH WEST REGION,,,Local Road,Local Road,Undivided,Undiv,Metro


Now we can extract additional information from our date variables with some attributes that you can find in the [dates section of pandas](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DatetimeIndex.html).

In [11]:
# here we are accessing the year
df_csv['ACCIDENT_DATE'].dt.year.head()

0    2013
1    2013
2    2013
3    2013
4    2013
Name: ACCIDENT_DATE, dtype: int64

In [12]:
# here we are accessing the month
df_csv['ACCIDENT_DATE'].dt.month.tail()

74903    1
74904    1
74905    7
74906    1
74907    1
Name: ACCIDENT_DATE, dtype: int64

## Exercise 1

Go to any of the websites below, download a dataset of your choosing and read it into memory with `pd.read_csv()`. Use at least one additional argument to read in your file.

- [Kaggle Datasets](https://www.kaggle.com/datasets)
- [UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php)
- [Data.GOV](Data.gov)
- [FiveThirtyEight](https://data.fivethirtyeight.com/)

In [17]:
pd.read_csv('../Module 1/datasets/files/WorldPopulation.csv')

Unnamed: 0,Country Name,Country Code,Indicator Name,1960,1961,1962,1963,1964,1965,1966,...,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019
0,Aruba,ABW,"Population, total",54211.0,55438.0,56225.0,56695.0,57032.0,57360.0,57715.0,...,101669.0,102046.0,102560.0,103159.0,103774.0,104341.0,104872.0,105366.0,105845.0,106314.0
1,Afghanistan,AFG,"Population, total",8996973.0,9169410.0,9351441.0,9543205.0,9744781.0,9956320.0,10174836.0,...,29185507.0,30117413.0,31161376.0,32269589.0,33370794.0,34413603.0,35383128.0,36296400.0,37172386.0,38041754.0
2,Angola,AGO,"Population, total",5454933.0,5531472.0,5608539.0,5679458.0,5735044.0,5770570.0,5781214.0,...,23356246.0,24220661.0,25107931.0,26015780.0,26941779.0,27884381.0,28842484.0,29816748.0,30809762.0,31825295.0
3,Albania,ALB,"Population, total",1608800.0,1659800.0,1711319.0,1762621.0,1814135.0,1864791.0,1914573.0,...,2913021.0,2905195.0,2900401.0,2895092.0,2889104.0,2880703.0,2876101.0,2873457.0,2866376.0,2854191.0
4,Andorra,AND,"Population, total",13411.0,14375.0,15370.0,16412.0,17469.0,18549.0,19647.0,...,84449.0,83747.0,82427.0,80774.0,79213.0,78011.0,77297.0,77001.0,77006.0,77142.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
259,Kosovo,XKX,"Population, total",947000.0,966000.0,994000.0,1022000.0,1050000.0,1078000.0,1106000.0,...,1775680.0,1791000.0,1807106.0,1818117.0,1812771.0,1788196.0,1777557.0,1791003.0,1797085.0,1794248.0
260,"Yemen, Rep.",YEM,"Population, total",5315355.0,5393036.0,5473671.0,5556766.0,5641597.0,5727751.0,5816247.0,...,23154855.0,23807588.0,24473178.0,25147109.0,25823485.0,26497889.0,27168210.0,27834821.0,28498687.0,29161922.0
261,South Africa,ZAF,"Population, total",17099840.0,17524533.0,17965725.0,18423161.0,18896307.0,19384841.0,19888250.0,...,51216964.0,52004172.0,52834005.0,53689236.0,54545991.0,55386367.0,56203654.0,57000451.0,57779622.0,58558270.0
262,Zambia,ZMB,"Population, total",3070776.0,3164329.0,3260650.0,3360104.0,3463213.0,3570464.0,3681955.0,...,13605984.0,14023193.0,14465121.0,14926504.0,15399753.0,15879361.0,16363507.0,16853688.0,17351822.0,17861030.0


In [18]:
df_excel = pd.read_excel("../Module 1/datasets/files/supermarket_demo.xlsx", sheet_name = 'egypt', parse_dates=True)
df_excel.tail()

Unnamed: 0,invoiceID,branch,city,cust_type,gender,type,unit_price,quantity,date,time,payment,cost,gross income,rating
995,894-41-5205,C,Alexandria,Normal,Female,Food and beverages,43.18,8,2019-01-19,19:39:00,Credit card,345.44,17.272,8.3
996,895-03-6665,B,Ismailia,Normal,Female,Fashion accessories,36.51,9,2019-02-16,10:52:00,Cash,328.59,16.4295,4.2
997,895-66-0685,B,Ismailia,Member,Male,Food and beverages,18.08,3,2019-03-05,19:46:00,eWallet,54.24,2.712,8.0
998,896-34-0956,A,Cairo,Normal,Male,Fashion accessories,21.32,1,2019-01-26,12:43:00,Cash,21.32,1.066,5.9
999,898-04-2717,A,Cairo,Normal,Male,Fashion accessories,76.4,9,2019-03-19,15:49:00,eWallet,687.6,34.38,7.5


In [19]:
#check the new type
type(df_excel['date'][0])

pandas._libs.tslibs.timestamps.Timestamp

In [20]:
type(df_excel.loc[0, 'date'])

pandas._libs.tslibs.timestamps.Timestamp

In [21]:
type(df_excel.iloc[0,8])

pandas._libs.tslibs.timestamps.Timestamp

# 3. Excel Files

![excel file](https://s3-us-west-2.amazonaws.com/cdn.mychoicesoftware.com/blog/Excel_Meme.jpg)

Excel files are very common, especially if some or all of the members in your team use it for their analyses and tend to share these with us periodically. If this is the case for you, this would mean that you would have to constantly read Excel files at work either with Excel or Google Sheets. But that is up until this point, of course. Fortunately, pandas provides a nice method to read in excel files, that is flexible enough to allow you te read in specific sheets at a time if that is what your use case requires.

The pandas function, `pd.read_excel()`, just like `pd.read_csv()`, provides a plethora of options that you can choose from to tackle the complexity with which many Excel created.

In [None]:
# read in a regular file
df_excel = pd.read_excel("../datasets/files/supermarket_demo.xlsx")
df_excel.tail()

If you open up the supermarket dataset you will notice that it contains 2 sheets. pandas function `pd.read_excel()` by default reads in the first sheet it finds in a spreadsheet so in our case, that is the myanmar dataset as shown above. Let's now read in the egypt one with the help of the `sheet_name=` argument.

Note that in the call below we also use the `parse_dates=True` argument instead of specifying the columns we want to parse, this is to tell pandas to infer which variables represent dates while it reads in the data. This method works well often but in most cases, it is better to be explicit about which variables you would like to parse as date type as opposed to leaving it to pandas.

In [None]:
df_excel = pd.read_excel("../datasets/files/supermarket_demo.xlsx", sheet_name='egypt', parse_dates=True)
df_excel.tail()

In [None]:
# check the new type
type(df_excel['date'][0])

## Exercise 2

1. Find an Excel (or any spreadsheet) file in your computer that has data in a tabular format (i.e. a big square with rows and columns) and read it into your session with `pd.read_excel()`.

2. If you can't find one to read in, create one with fake data and use that one insted. It does not need to have a lot of data in it. 10 rows and 5 columns would work fine.

3. If you don't have Excel, you can create a spreadsheet using Google Sheets and download it as an Excel file.

If neiether option above is feasible for you, please move on to the next section.

In [24]:
world_excel = pd.read_excel("../Module 1/datasets/files/WorldPopulation_excel.xlsx")
world_excel.tail()

Unnamed: 0,Country Name,Country Code,Indicator Name,1960,1961,1962,1963,1964,1965,1966,...,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019
259,Kosovo,XKX,"Population, total",947000.0,966000.0,994000.0,1022000.0,1050000.0,1078000.0,1106000.0,...,1775680.0,1791000.0,1807106.0,1818117.0,1812771.0,1788196.0,1777557.0,1791003.0,1797085.0,1794248.0
260,"Yemen, Rep.",YEM,"Population, total",5315355.0,5393036.0,5473671.0,5556766.0,5641597.0,5727751.0,5816247.0,...,23154855.0,23807588.0,24473178.0,25147109.0,25823485.0,26497889.0,27168210.0,27834821.0,28498687.0,29161922.0
261,South Africa,ZAF,"Population, total",17099840.0,17524533.0,17965725.0,18423161.0,18896307.0,19384841.0,19888250.0,...,51216964.0,52004172.0,52834005.0,53689236.0,54545991.0,55386367.0,56203654.0,57000451.0,57779622.0,58558270.0
262,Zambia,ZMB,"Population, total",3070776.0,3164329.0,3260650.0,3360104.0,3463213.0,3570464.0,3681955.0,...,13605984.0,14023193.0,14465121.0,14926504.0,15399753.0,15879361.0,16363507.0,16853688.0,17351822.0,17861030.0
263,Zimbabwe,ZWE,"Population, total",3776681.0,3905034.0,4039201.0,4178726.0,4322861.0,4471177.0,4623351.0,...,12697723.0,12894316.0,13115131.0,13350356.0,13586681.0,13814629.0,14030390.0,14236745.0,14439018.0,14645468.0


In [23]:
world_excel = pd.read_excel("../Module 1/datasets/files/WorldPopulation_excel.xlsx", sheet_name='WorldPopulation', parse_dates=True)
world_excel.tail()

Unnamed: 0,Country Name,Country Code,Indicator Name,1960,1961,1962,1963,1964,1965,1966,...,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019
259,Kosovo,XKX,"Population, total",947000.0,966000.0,994000.0,1022000.0,1050000.0,1078000.0,1106000.0,...,1775680.0,1791000.0,1807106.0,1818117.0,1812771.0,1788196.0,1777557.0,1791003.0,1797085.0,1794248.0
260,"Yemen, Rep.",YEM,"Population, total",5315355.0,5393036.0,5473671.0,5556766.0,5641597.0,5727751.0,5816247.0,...,23154855.0,23807588.0,24473178.0,25147109.0,25823485.0,26497889.0,27168210.0,27834821.0,28498687.0,29161922.0
261,South Africa,ZAF,"Population, total",17099840.0,17524533.0,17965725.0,18423161.0,18896307.0,19384841.0,19888250.0,...,51216964.0,52004172.0,52834005.0,53689236.0,54545991.0,55386367.0,56203654.0,57000451.0,57779622.0,58558270.0
262,Zambia,ZMB,"Population, total",3070776.0,3164329.0,3260650.0,3360104.0,3463213.0,3570464.0,3681955.0,...,13605984.0,14023193.0,14465121.0,14926504.0,15399753.0,15879361.0,16363507.0,16853688.0,17351822.0,17861030.0
263,Zimbabwe,ZWE,"Population, total",3776681.0,3905034.0,4039201.0,4178726.0,4322861.0,4471177.0,4623351.0,...,12697723.0,12894316.0,13115131.0,13350356.0,13586681.0,13814629.0,14030390.0,14236745.0,14439018.0,14645468.0


In [26]:
#Running terminal commands in here
!dir

 Volume in drive C is DS_MSURFACEB2
 Volume Serial Number is 386D-B60D

 Directory of C:\Users\monch.mercader\Python\Data_Analytics\Module 1

21/11/2020  03:08 PM    <DIR>          .
21/11/2020  03:08 PM    <DIR>          ..
21/11/2020  03:00 PM    <DIR>          .ipynb_checkpoints
31/10/2020  11:53 AM            12,944 00_CourseIntro.ipynb
21/11/2020  08:42 AM           102,019 01_PythonIntro-1.ipynb
07/11/2020  11:38 AM            23,312 02_ControlFlow.ipynb
10/11/2020  07:59 PM            62,627 03_intermediate_python.ipynb
10/11/2020  09:19 PM            73,045 04_numerical_computing.ipynb
21/11/2020  10:27 AM           180,821 05_pandas.ipynb
21/11/2020  03:00 PM            61,827 06_data_gathering.ipynb
21/11/2020  03:08 PM            84,360 06_data_gathering_monch.ipynb
21/11/2020  08:40 AM            38,227 07_data_cleaning.ipynb
10/11/2020  06:09 PM            12,304 additional_numpy_challenges.ipynb
21/11/2020  08:40 AM           301,193 csv_file.png
21/11/2020  01:00 PM     

# 4. HTML Files

![html](https://media.giphy.com/media/l3vRfNA1p0rvhMSvS/giphy.gif)

> "Hypertext Markup Language (HTML) is the standard markup language for documents designed to be displayed in a web browser. It can be assisted by technologies such as Cascading Style Sheets (CSS) and scripting languages such as JavaScript." ~ [Wikipedia](https://en.wikipedia.org/wiki/HTML)

pandas has, among many things, a function to allow us to read [HTML](https://html.spec.whatwg.org/) tables from a website. This function is `pd.read_html()`, and although it is not a full-fledge web scraping tool such as [Scrapy](https://scrapy.org/) or [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/). These last two libraries are very powerful web scraping tools that you are more than encouraged to explore on your own. Intermediate to complex web scraping requires a fair amount of knowledge on how the structure of a website works but I have no dobts that with a few hours of focused studying, a couple of projects later, or in the next couple of minutes, you might be well on your way to scraping your own data with 🐼. 😎

Before we explore pandas method for web scraping, let's quickly define it:

> **Web Scraping** refers to extracting data, structured or unstructured, from websites and making it useful for a variety of purposes, such as marketing analysis. Companies in the marketing arena use web scraping to colect comments about their products. Others, like Google, scrape the entire internet to rank websites given a criterion or search query. While web scraping might be limited in scope to a single website, like what a marketer might do, **web crawling** is the art of crawling over many different and/or nested websites on one try, or repeadately over time, like what Google does.

We will be scraping the the International Foundation for Art Research website using the link below. An important thing to keep in mind is that, the pandas function `pd.read_html()` captures whichever tables it can find in the website provided and it then adds them to a list of dataframes where each  dataframe comes from a table. This means that you would have to first assign the list to a variable and then dump the table or tables you want into a combined dataframe.

http://www.ifar.org/catalogues_raisonnes.php?alpha=&searchtype=artist&published=1&inPrep=1&artist=&author=

How to check whether there is a table in a website or not. There are probably plenty of ways to check whether there is a table in a website or not, so here are two immediate ones.

1. See if there is a table-like shape in the website that you are interested in. This table would ideally have information in a shape that would fit into a pandas dataframe. For example,
![table](pictures/table.png)
2. The second option is to navigate to the website you are interested in and 

If you have any issues with the `pd.read_html()` function, please check and see if you have the following packages installed and then try again.

- `conda install lxml`
- `pip install beautifulsoup4 html5lib`

In [43]:
data = pd.read_html('http://www.ifar.org/catalogues_raisonnes.php?alpha=&searchtype=artist&published=1&inPrep=1&artist=&author=')

In [44]:
print(type(data), len(data))

<class 'list'> 1


In [46]:
type(data[0])

pandas.core.frame.DataFrame

In [47]:
df_html = data[0]
df_html.head()

Unnamed: 0,0,1,2,3,4
0,Name,Birth Year,Birth Place,Death Year,Death Place
1,"Aachen, Hans von click to learn more",1552,Germany,1615,Czech Republic
2,"Aalto, Alvar click to learn more",1898,Finland,1976,Finland
3,"Abbati, Giuseppe click to learn more",1836,Italy,1868,Italy
4,"Abdessemed, Adel click to learn more",1971,Algeria,,


Notice that the column names are not where they should be. Let's fix that.

We will take the column names from the first row, convert the selection to a regular Python list and then reasign these names to the column names of our dataframe.

In [48]:
# take the first row out and make it a list
col_names = df_html.iloc[0].tolist()
col_names

['Name', 'Birth Year', 'Birth Place', 'Death Year', 'Death Place']

In [49]:
# reasign the names to the column names index
df_html.columns = col_names

# drop the first row of the dataframe
df_html.drop(index=0, axis=0, inplace=True)
df_html.head()

Unnamed: 0,Name,Birth Year,Birth Place,Death Year,Death Place
1,"Aachen, Hans von click to learn more",1552,Germany,1615.0,Czech Republic
2,"Aalto, Alvar click to learn more",1898,Finland,1976.0,Finland
3,"Abbati, Giuseppe click to learn more",1836,Italy,1868.0,Italy
4,"Abdessemed, Adel click to learn more",1971,Algeria,,
5,"Abelenda Zapata, Manuel click to learn more",1889,Spain,1957.0,Spain


In [53]:
#How to find out what a list is
type(df_html.iloc[0,0])

str

In [60]:
#Rstrip to take out a set of characters this is a test
'Aachen, Hans von click to learn more'.replace(' click to learn more', '')

'Aachen, Hans von'

In [62]:
#Applying the effect on the list
df_html['Name'].str.replace(' click to learn more', '').head()

1           Aachen, Hans von
2               Aalto, Alvar
3           Abbati, Giuseppe
4           Abdessemed, Adel
5    Abelenda Zapata, Manuel
Name: Name, dtype: object

In [63]:
df_html['Name'].str

<pandas.core.strings.StringMethods at 0x249726ca340>

## Exercise 3

Find a table to scrape in World Wide Web and read it in with pandas.

In [65]:
inara = pd.read_html('https://inara.cz/galaxy-starsystem/11877/')
print(type(inara), len(inara))

type(inara[1])

inara_cz = inara[1]
inara_cz.head()


<class 'list'> 6


Unnamed: 0,Power,Star system,Controlling faction / Factions,Inf,Inf.1,Fac,Dist,Updated
0,,Zeta Trianguli Australis✂︎,Zeta Trianguli Australis CorporationProgressiv...,12.1%,53.2%,7,5.26 Ly,now
1,C,Jiuyou✂︎,Coalition of LFT 1349Jiuyou AllianceProgressiv...,9.2%6.8%4.0%,57.5%,8,6.78 Ly,55 minutes ago
2,,LFT 1349✂︎,Arbor Caelum Internal Defense [Player]L 206-18...,7.2%4.9%,65.6%,7,7.37 Ly,51 minutes ago
3,,LHS 3167✂︎,United Systems Imperium [Player],,43.8%,7,8.03 Ly,5 hours ago
4,,Mari✂︎,People's Mari Revolutionary PartyProgressive P...,13.7%,46.2%,7,8.20 Ly,4 hours ago


## Awesome Work!

You are now ready to start cleaning and preparing datasets for analysis!

![great_work](https://media.giphy.com/media/SWzVtsCPEPggXQsGoT/giphy.gif)