# Part 1: Python Review
## 1.1 Pandas
We will first review some basic usage of pandas dataframe.

In [None]:
#first import pandas
import pandas as pd
#read data from csv
dataset = pd.read_csv('./world_population.csv', index_col=0)

In [None]:
#check the data
dataset.head()

**Note:**   
`.iloc()` and `.loc()` are two important methods when indexing with Pandas. They allow to make precise selections of data based on either the integer value index (`iloc`) or the index column (`loc`), which in our case is the country name column.

In [None]:
#locate the USA row
dataset.loc[["United States"]].head()

In [None]:
#locate rows Germany, Singapore, United States, and India 
dataset.loc[["Germany", "Singapore", "United States", "India"]]

In [None]:
#locate the last second to last row by index
dataset.iloc[[-2]]

In [None]:
#get the column of 2000
dataset["2000"].head()

In [None]:
#locate countries of rows 2 to 5
dataset.iloc[1:5]

In [None]:
#subset of Germany, Singapore, United States, and India 
#for years 1970, 1990, 2010
country_list = ["Germany", "Singapore", "United States", "India"]
dataset.loc[country_list][["1970", "1990", "2010"]]

In [None]:
#calculate the mean of the third row
dataset.iloc[[2]].mean(axis=1)

In [None]:
#calculate the mean of the last row
dataset.iloc[[-1]].mean(axis=1)

In [None]:
#calculate the mean of the country Germany
dataset.loc[["Germany"]].mean(axis=1)

In [None]:
# filter columns 1961, 2000, and 2015
dataset.filter(items=["1961", "2000", "2015"]).head()

In [None]:
# filter countries that had a greater population density than 500 in 2000
dataset[(dataset["2000"] > 500)][["2000"]]

# Part 2: Advanced Python
## 2.1 JSON




### Load JSON File in Python
Before starting to load files, you will first need to uncompress the structured-2018-01-14-neworleans.tar.gz file. Then the json files are available under the structured-2018-01-14-neworleans folder.

In [None]:
import os
#you can compile the directory here
datadir = os.path.join('./', 'structured-2018-01-14-neworleans')

#let's load one of the json files
jsonfile = 'structured-1515984523-6592b573-b485-58b0-963e-6be0b4d02f6c.json'

#create the full file path
jsonpath = os.path.join(datadir, jsonfile)
print(jsonpath)

We can open the file and read the raw data:

In [None]:
# open the file
with open(jsonpath, 'r') as f:
    rawdata = f.read()

In [None]:
type(rawdata)

In [None]:
#check the rawdata
rawdata

### Make the JSON file readable

We have the raw text read in as a string, but we want to "unpack" it to make the into a readable format. We also call this "deserialization".

Let's start with the standard json library, then followed by ujson, a faster library for json.

In [None]:
import json
data = json.loads(rawdata)

In [None]:
type(data)

In [None]:
data

The data is essentially a nested dictionary. Let's compare.

In [None]:
simpledict = {"a": 1, "b": 2}
simpledict.keys()

Meanwhile, the values can themselves be dicts and lists:

In [None]:
nesteddict = {"a": [5, 6, 7], "b": {"dogs": 10, "cats": 11}}
nesteddict.keys()

In [None]:
data.keys()

In [None]:
data['title']

In [None]:
data['platform']

In [None]:
#timestamps are often stored as "unix timestamps"
#it is the number of seconds elapsed since Jan 1, 1970
data['start_time_s']

In [None]:
data['end_time_s']

In [None]:
data['end_time_s'] - data['start_time_s']

In [None]:
#confirm the duration
data['duration_ms']

In [None]:
data['map']

In [None]:
data['rounds']

In [None]:
data['teams']

In [None]:
data['teams'][0]

In [None]:
data['teams'][0]['name']

In [None]:
data['players']

In [None]:
data['players'][0]

In [None]:
#if you don't have ujson, you can install it
!pip install ujson

In [None]:
#let's try ujson, which is similar but faster for larger data
import ujson
data = ujson.loads(rawdata)

In [None]:
type(data)

In [None]:
data

In [None]:
data.keys()

In [None]:
data['title']

In [None]:
#write json object to disk
with open('./match.json', 'w') as f:
    ujson.dump(data, f)

## 2.2 Lambda Function
Here we will start with some simple artifical examples to know the basics of lambda (anonymous) function.
We then use another dataset with pandas to know how to apply lambda function in data analysis.

Lambda function in Python can be expressed as:
```lambda argument_list:expersion```

### Simple Lambda Functions

In [None]:
#a simple example
#define a function sq
def sq(x):
    return x*x

In [None]:
#map(function, interator)
#map is to apply with function on all the elements in the iterator
list(map(sq, [y for y in range(10)]))

In [None]:
#the same process can be done by lambda function lambda x: x*x
list(map(lambda x:x*x, [y for y in range(10)]))

In [None]:
#you can assign the anonymous function to a variable
c=lambda x,y,z:x*y*z
c(2,3,4)

In [None]:
#it can even be called directly
(lambda x:x**2)(3)

In [None]:
#filter data with lambda function
list(filter(lambda x: x%3==0, [1,2,3,4,5,6]))

In [None]:
#filter data with lambda function
Names = ['Anne', 'Amy', 'Bob', 'David', 'Carrie', 'Barbara', 'Zach']

#filter those names starting with B
B_Name= list(filter(lambda x: x.startswith('B'), Names))
print(B_Name)

In [None]:
#you will use map() in spark, here is another example
squares = map(lambda x:x**2, range(5))
list(squares)

In [None]:
#use lambda with reduce() function
from functools import reduce
print(reduce(lambda a,b:'{},{}'.format(a,b), [1,2,3,4,5,6,7,8,9]))

In [None]:
print(reduce(lambda a,b:a+b, [1,2,3,4,5,6,7,8,9]))

In [None]:
#sort data
a=[('b',3),('a',2),('d',4),('c',1)]

#sort by the key
sorted(a, key=lambda x:x[0])

In [None]:
#sort by the value
sorted(a, key=lambda x:x[1])

In [None]:
#add two lists together
a = [1,2,3,4]
b = [5,6,7,8]
print(list(map(lambda x,y:x+y, a, b)))

In [None]:
sentence = "Welcome To University of Colorado Boulder!"
words = sentence.split()
lengths  = map(lambda x:len(x), words)
print(list(lengths))

### Use Lambda Function in Pandas

In [None]:
#load a dataset to apply lambda function
dataset = pd.read_csv('./olympia2016_athletes.csv')
#check the data
dataset.head()

In [None]:
#let's first obtain the year of birth for player
#we can define a function
def dob_trans(player):
    dob = str(player[4])
    yr = dob[-2:]
    return yr

dataset['yr_1'] = dataset.apply(dob_trans, axis = 1)
dataset.head()

In [None]:
#use lambda function
dataset['yr_2'] = dataset.apply(lambda x:str(x[4])[-2:], axis = 1)
dataset.head()

In [None]:
dataset['sex'].unique()

In [None]:
#now we want to transform the gender column to a single character
#we can define a function
def gender_trans(player):
    gender = player[3]
    if gender == "female":
        return "F"
    elif gender == "male":
        return "M"
    else:
        return ""

dataset['gender_trans_1'] = dataset.apply(gender_trans, axis = 1)
dataset.head()

**Note:**  
Using if/elif/else is pretty complicated in lambda function, but sometimes it saves a lot of efforts for writing separate functions. Here are the code structure for using if/elif/else in lambda:

```lambda <arguments> : <Return Value if condition is True> if <condition> else <Return Value if condition is False>```

```lambda <args> : <return Value> if <condition > else ( <return value > if <condition> else <return value>)```

In [None]:
#use lambda function
dataset['gender_trans_2'] = dataset.apply(lambda x: "F" if x[3]=="female" else ("M" if x[3]=="male" else ""), axis = 1)
dataset.head()

## 2.3 Parquet File
Here we will use pandas to save and load parquet files. Later on we will also use spark to save and load parquet files. Their logics are pretty similar.

In [None]:
#you will need pyarrow to support parquet then use pandas
#for convenience, you can do pip in notebook directly
!pip install pyarrow

In [None]:
import pandas as pd
#read a csv data and write to parquet
df = pd.read_csv('./olympia2016_athletes.csv')
df.to_parquet('./olympia2016_athletes.parquet')
#on your disk you can see the parquet file, it is smaller than your csv

In [None]:
#read data from parquet format
df = pd.read_parquet('./olympia2016_athletes.parquet')
df.head()