### <span style="color:black"><b>Pandas Tutorial 1</b></span>

---

<u>Reading and writing files</u>

* In order to analyse a dataset we first need to be able to load it in! 
* In this video, we will read in some data from a variety of sources and save the files for further use

Useful top level pandas functions:
<pre>
pd.read_csv()
pd.read_html()
pd.read_json()
</pre>

Dataframe methods:
<pre>
data.head()
data.tail()
data.sample()
data.to_csv()
</pre>




In [1]:
# Without this line nothing below it will run!
import pandas as pd

In [2]:
# Read in the data, storing the result in a variable called tesla_data
tesla_data = pd.read_csv('tesla.csv')

In [3]:
# Look at the first 5 rows
tesla_data.head()

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,2020-08-19,373.0,382.200012,368.242004,375.705994,375.705994,61026500
1,2020-08-20,372.135986,404.39801,371.411987,400.365997,400.365997,103059000
2,2020-08-21,408.951996,419.097992,405.01001,409.996002,409.996002,107448000
3,2020-08-24,425.256012,425.799988,385.503998,402.839996,402.839996,100318000
4,2020-08-25,394.977997,405.589996,393.600006,404.667999,404.667999,53294500


In [4]:
# Look at the last 10 rows
tesla_data.tail(10)

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
243,2021-08-06,711.900024,716.330017,697.630005,699.099976,699.099976,15576200
244,2021-08-09,710.169983,719.030029,705.130005,713.76001,713.76001,14715300
245,2021-08-10,713.98999,716.590027,701.880005,709.98999,709.98999,13432300
246,2021-08-11,712.710022,715.179993,704.210022,707.820007,707.820007,9800600
247,2021-08-12,706.340027,722.799988,699.400024,722.25,722.25,17459100
248,2021-08-13,723.710022,729.900024,714.340027,717.169983,717.169983,16698900
249,2021-08-16,705.070007,709.5,676.400024,686.169983,686.169983,22677400
250,2021-08-17,672.659973,674.580017,648.840027,665.710022,665.710022,23721300
251,2021-08-18,669.75,695.77002,669.349976,688.98999,688.98999,20220400
252,2021-08-19,669.747925,686.549988,678.02002,681.940002,681.940002,1074252


In [5]:
# Look at 3 random rows
tesla_data.sample(3)

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
44,2020-10-21,422.700012,432.950012,421.25,422.640015,422.640015,32370500
206,2021-06-15,616.690002,616.789978,598.22998,599.359985,599.359985,17764100
214,2021-06-25,689.580017,693.809998,668.700012,671.869995,671.869995,32496700


In [6]:
# Read in the samsung file
samsung = pd.read_csv('samsung.csv')

In [7]:
# Look at the first 5 rows
samsung.head(5)

Unnamed: 0,Noah Rubin
0,This is tutorial 1. Here we will read in some ...
1,Pandas video
2,My file: File content below...
3,/Date/Open/Close
4,0/2020-08-19/51000.0/50100.0


So we have three main issues here 

1. The first four rows are not useful to us
2. The columns are not nicely displayed because of the slash character in between the values
3. There is a meaningless column on the far left

It turns out that if we just pass in some extra arguments into `pd.read_csv()`, we can take care of this with ease

In [8]:
# This is what we'd need to assign to samsung instead
pd.read_csv('samsung.csv', skiprows=4, sep='/', usecols=['Date', 'Open', 'Close'])

Unnamed: 0,Date,Open,Close
0,2020-08-19,51000.0,50100.0
1,2020-08-20,50000.0,48250.0
2,2020-08-21,48500.0,48650.0
3,2020-08-24,48700.0,48600.0
4,2020-08-25,49000.0,48750.0
...,...,...,...
244,2021-08-12,72300.0,71800.0
245,2021-08-13,70700.0,69600.0
246,2021-08-17,69200.0,69100.0
247,2021-08-18,68900.0,69500.0


<u>Reading in data from a website</u>

What if the data we'd like to obtain is not in a csv file, but is instead located on a web page? For tasks like these we can run the `pd.read_html()` top level function. This is designed for html tables and saves us from having to write complicated scripts in Beautiful Soup and Selenium etc.

In [9]:
# Read in the asx etf data
stock_data = pd.read_html('https://en.wikipedia.org/wiki/List_of_Australian_exchange-traded_funds')[0]
stock_data.head()  # Great

Unnamed: 0,ASX Code,Issuer,Name,Benchmark,Domicile,MER%
0,A200,BetaShares,BetaShares Australia 200 ETF,Solactive Australia 200 Index,AUS,0.07
1,NDQ,BetaShares,BetaShares NASDAQ 100,NASDAQ 100,AUS,0.48
2,HNDQ,BetaShares,BetaShares NASDAQ 100 - Currency Hedged,NASDAQ 100 AUD Hedged,AUS,0.51
3,ATEC,BetaShares,BetaShares S&P/ASX Australian Technology ETF,S&P/ASX All Technology Index,AUS,0.48
4,CLDD,BetaShares,BetaShares Cloud Computing ETF,Indxx Global Cloud Computing Index,AUS,0.67


In [10]:
# Save the file to a csv
stock_data.to_csv('my_stock_data.csv', index=False)

In [11]:
# Load in the teams.json file
teams = pd.read_json('teams.json')
teams.head()

Unnamed: 0,Year,First place,Second place,Third place
0,1993,Germany,Italy,Brazil
1,1994,Brazil,Spain,Sweden
2,1995,Brazil,Germany,Italy
3,1996,Brazil,Germany,France
4,1997,Brazil,Germany,Czech Republic


Additional content:

* [Read in excel files](https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html)
* [Linking to a database and reading in data](https://pandas.pydata.org/docs/reference/api/pandas.read_sql.html)