# Tutorial 6: Data cleaning and wrangling with Pandas
This tutorials cover the practice questions for additional Pandas functions and methods that provide further processing of datasets, complementary to Lecture 7.

### 6.1 Groupby method in Pandas
- Work similarly to  `Group By` statement in SQL.
- You may put one or more column names into the arguments of `DataFrame.groupby()` method, so groups are formed based on values of inputted columns in each row.
- Returns a `GroupBy` object, which organizes the data according to the group formation. We may use aggregation/ statistical methods from a `GroupBy` object in order to return the aggregated data evaluated from rows in each group.
- Common methods for `GroupBy` objects are `.max()`, `.min()`, `.mean()`, `.std()`, `.first()`, `.last()`, `.sum()` and so on.

In the following code, we import `macrodata.csv` file from Dr. Huang's data files, where the Dropbox download link is stated in Lecture 5 notes.

In [1]:
import pandas as pd
""" Note that directory path copied from Windows file explorer is preceded by r, such that each backslash in the path is interpreted as foreward slash,
the standard directory separator in Python.
If you are using Mac or Linux, please do NOT add the letter r before the copied path. """
macro_file_path = r"D:\OneDrive - The University Of Hong Kong\HKU TA\Fall 2023-2024\FINA2390\data_to_share\macrodata.csv" #delete r when using Mac/ Linux
macro_df = pd.read_csv(macro_file_path)
print(macro_df)

       year  quarter    realgdp  realcons   realinv  realgovt  realdpi  \

0    1959.0      1.0   2710.349    1707.4   286.898   470.045   1886.9   

1    1959.0      2.0   2778.801    1733.7   310.859   481.301   1919.7   

2    1959.0      3.0   2775.488    1751.8   289.226   491.260   1916.4   

3    1959.0      4.0   2785.204    1753.7   299.356   484.052   1931.3   

4    1960.0      1.0   2847.699    1770.5   331.722   462.199   1955.5   

..      ...      ...        ...       ...       ...       ...      ...   

198  2008.0      3.0  13324.600    9267.7  1990.693   991.551   9838.3   

199  2008.0      4.0  13141.920    9195.3  1857.661  1007.273   9920.4   

200  2009.0      1.0  12925.410    9209.2  1558.494   996.287   9926.4   

201  2009.0      2.0  12901.504    9189.0  1456.678  1023.528  10077.5   

202  2009.0      3.0  12990.341    9256.0  1486.398  1044.088  10040.6   



         cpi      m1  tbilrate  unemp      pop  infl  realint  

0     28.980   139.7      2.82   

We may use `groupby` method to evaluate the real GDP in each year, where the four values in `realgdp` column in each year 
can be added up into the total real GDP for that year.

In [2]:
annual_realgdp = macro_df.groupby("year")["realgdp"].sum()
print(annual_realgdp)

year

1959.0    11049.842

1960.0    11323.727

1961.0    11587.518

1962.0    12289.560

1963.0    12826.833

1964.0    13569.259

1965.0    14440.510

1966.0    15381.368

1967.0    15770.092

1968.0    16533.571

1969.0    17047.199

1970.0    17079.758

1971.0    17653.052

1972.0    18590.919

1973.0    19668.039

1974.0    19559.665

1975.0    19518.076

1976.0    20565.181

1977.0    21510.607

1978.0    22710.496

1979.0    23420.197

1980.0    23355.917

1981.0    23948.758

1982.0    23483.778

1983.0    24544.681

1984.0    26308.465

1985.0    27397.060

1986.0    28346.035

1987.0    29253.109

1988.0    30455.554

1989.0    31543.709

1990.0    32135.631

1991.0    32060.570

1992.0    33148.287

1993.0    34093.797

1994.0    35482.691

1995.0    36374.894

1996.0    37735.577

1997.0    39417.332

1998.0    41134.064

1999.0    43119.396

2000.0    44903.909

2001.0    45388.625

2002.0    46211.892

2003.0    47362.803

2004.0    49055.256

2005.0    50553.500

2006.0 

Evaluate the ending balance of `cpi`, `m1` and `tbillrate` column from each year in the following example:

In [3]:
annual_freq_data1 = macro_df.groupby("year")[["cpi", "m1", "tbilrate"]].last()
print(annual_freq_data1)

            cpi      m1  tbilrate

year                             

1959.0   29.370   140.0      4.33

1960.0   29.840   141.1      2.29

1961.0   30.040   145.2      2.60

1962.0   30.440   148.3      2.87

1963.0   30.940   153.7      3.52

1964.0   31.280   160.7      3.76

1965.0   31.880   169.1      4.35

1966.0   32.900   171.9      5.00

1967.0   34.100   184.3      4.90

1968.0   35.700   198.7      5.85

1969.0   37.900   206.2      7.64

1970.0   39.900   215.5      4.86

1971.0   41.200   230.1      3.87

1972.0   42.700   251.5      5.09

1973.0   46.800   263.8      7.68

1974.0   52.300   273.9      6.96

1975.0   55.800   288.4      5.26

1976.0   58.700   308.3      4.57

1977.0   62.700   334.4      6.20

1978.0   68.500   358.6      9.02

1979.0   78.000   385.8     11.94

1980.0   87.200   411.3     14.75

1981.0   94.400   442.7     11.33

1982.0   97.900   477.2      7.96

1983.0  102.100   525.1      8.89

1984.0  105.700   557.0      8.14

1985.0  109.900   62

### 6.2 drop_duplicates usage

We may use `.drop_duplicates()` method of a DataFrame to remove rows that are duplicated in certain columns.

In [4]:
#Keep the last row for each year from macro_df DataFrame
macro_df_last_quarter = macro_df.drop_duplicates(subset="year", keep="last")

In [5]:
print(macro_df_last_quarter)

       year  quarter    realgdp  realcons   realinv  realgovt  realdpi  \

3    1959.0      4.0   2785.204    1753.7   299.356   484.052   1931.3   

7    1960.0      4.0   2802.616    1788.2   259.764   476.434   1966.6   

11   1961.0      4.0   2977.830    1859.6   315.463   502.521   2082.0   

15   1962.0      4.0   3100.563    1945.1   325.650   535.912   2154.6   

19   1963.0      4.0   3264.967    2020.6   364.534   532.383   2254.6   

23   1964.0      4.0   3431.957    2141.2   389.910   514.603   2420.4   

27   1965.0      4.0   3724.014    2314.3   446.493   544.121   2594.1   

31   1966.0      4.0   3884.520    2391.4   472.957   599.528   2688.2   

35   1967.0      4.0   3980.970    2465.7   462.834   640.234   2797.4   

39   1968.0      4.0   4178.293    2623.5   480.998   636.729   2918.4   

43   1969.0      4.0   4263.261    2704.1   492.334   606.900   3034.9   

47   1970.0      4.0   4256.637    2749.6   458.406   564.666   3135.1   

51   1971.0      4.0   44

In [6]:
#Keep the first row for each year from macro_df DataFrame
macro_df_first_quarter = macro_df.drop_duplicates(subset="year", keep="first")
print(macro_df_first_quarter)

       year  quarter    realgdp  realcons   realinv  realgovt  realdpi  \

0    1959.0      1.0   2710.349    1707.4   286.898   470.045   1886.9   

4    1960.0      1.0   2847.699    1770.5   331.722   462.199   1955.5   

8    1961.0      1.0   2819.264    1787.7   266.405   475.854   1984.5   

12   1962.0      1.0   3031.241    1879.4   334.271   520.960   2101.7   

16   1963.0      1.0   3141.087    1958.2   343.721   522.917   2172.5   

20   1964.0      1.0   3338.246    2060.5   379.523   529.686   2299.6   

24   1965.0      1.0   3516.251    2188.8   429.145   508.006   2447.4   

28   1966.0      1.0   3815.423    2348.5   484.244   556.593   2618.4   

32   1967.0      1.0   3918.740    2405.3   460.007   640.682   2728.4   

36   1968.0      1.0   4063.013    2524.6   472.907   651.378   2846.2   

40   1969.0      1.0   4244.100    2652.9   512.686   633.224   2923.4   

44   1970.0      1.0   4256.573    2720.7   476.925   594.888   3050.1   

48   1971.0      1.0   43

### 6.3 Combining datasets with merge function

`pd.merge()` function is used for combining two DataFrames that represent different variables (columns) from the same set of observations (identifier). 

In [7]:
# create dataframes from the dictionaries
data1 = {
    'EmployeeID' : ['E001', 'E002', 'E003', 'E004', 'E005'],
    'Name' : ['John Doe', 'Jane Smith', 'Peter Brown', 'Tom Johnson', 'Rita Patel'],
    'DeptID': ['D001', 'D003', 'D001', 'D002', 'D003'],
}
employees = pd.DataFrame(data1)

data2 = {
    'DeptID': ['D001', 'D002', 'D003'],
    'DeptName': ['Sales', 'HR', 'Admin']
}
departments = pd.DataFrame(data2)

# merge dataframes employees and departments
merged_df = pd.merge(employees, departments)
print(employees)
print(departments)
print(merged_df)

  EmployeeID         Name DeptID

0       E001     John Doe   D001

1       E002   Jane Smith   D003

2       E003  Peter Brown   D001

3       E004  Tom Johnson   D002

4       E005   Rita Patel   D003

  DeptID DeptName

0   D001    Sales

1   D002       HR

2   D003    Admin

  EmployeeID         Name DeptID DeptName

0       E001     John Doe   D001    Sales

1       E003  Peter Brown   D001    Sales

2       E002   Jane Smith   D003    Admin

3       E005   Rita Patel   D003    Admin

4       E004  Tom Johnson   D002       HR


- Note that both input DataFrames have `DeptID` as the common column. \
`merged_df` is formed by combining rows from input DataFrames that share the same `DeptID` value. \
Under the default behavior `how = "inner"`, rows only appear in the merged DataFrame if the key appears in BOTH input DataFrames.

In [10]:
merged_df2 = pd.merge(employees, departments, on="DeptID", how="inner")
merged_df2.equals(merged_df) #DeptID column serves as the key for merge function

True

- The following example shows the merge application where the key column name is different in the two input DataFrames.

In [12]:
import numpy as np
np.random.seed(5)
left = pd.DataFrame({'key_left': ['A', 'B', 'C', 'D'], 'value': np.random.randn(4)})
right = pd.DataFrame({'key_right': ['B', 'D', 'E', 'F'], 'value': np.random.randn(4)})
print(left)
print(right)

  key_left     value

0        A  0.441227

1        B -0.330870

2        C  2.430771

3        D -0.252092

  key_right     value

0         B  0.109610

1         D  1.582481

2         E -0.909232

3         F -0.591637


In [14]:
merged_df = pd.merge(left, right, left_on='key_left', right_on='key_right')
print(merged_df)

  key_left   value_x key_right   value_y

0        B -0.330870         B  0.109610

1        D -0.252092         D  1.582481


In [16]:
merged_df = pd.merge(left, right, left_on='key_left', right_on='key_right', how="left")
print(merged_df)

  key_left   value_x key_right   value_y

0        A  0.441227       NaN       NaN

1        B -0.330870         B  0.109610

2        C  2.430771       NaN       NaN

3        D -0.252092         D  1.582481


### 6.4 Combining datasets with concat function
`pd.concat()` function is used for combining DataFrames in which they share the same set of variables (columns) but represents different observations. 

In [19]:
df = pd.DataFrame({'Courses': ["Spark","PySpark","Python","pandas"],
                    'Fee' : [20000,25000,22000,24000]})

df1 = pd.DataFrame({'Courses': ["Pandas","Hadoop","Hyperion","Java"],
                    'Fee': [25000,25200,24500,24900]})
print("First DataFrame:\n", df)
print("Second DataFrame:\n", df1)

First DataFrame:

    Courses    Fee

0    Spark  20000

1  PySpark  25000

2   Python  22000

3   pandas  24000

Second DataFrame:

     Courses    Fee

0    Pandas  25000

1    Hadoop  25200

2  Hyperion  24500

3      Java  24900


In [20]:
data = [df, df1]
df2 = pd.concat(data)
print("After concatenating the two DataFrames:\n", df2)

After concatenating the two DataFrames:

     Courses    Fee

0     Spark  20000

1   PySpark  25000

2    Python  22000

3    pandas  24000

0    Pandas  25000

1    Hadoop  25200

2  Hyperion  24500

3      Java  24900


In [22]:
df3 = pd.concat(data, ignore_index=True)
print(df3)

    Courses    Fee

0     Spark  20000

1   PySpark  25000

2    Python  22000

3    pandas  24000

4    Pandas  25000

5    Hadoop  25200

6  Hyperion  24500

7      Java  24900


### 6.5 Practice Questions
Refer to `nba_all_elo.csv` file downloaded in Tutorial 4, available at this URL https://raw.githubusercontent.com/fivethirtyeight/data/master/nba-elo/nbaallelo.csv.
Import into Pandas DataFrame as `nba`.
1. for `notes` column, replace `"NULL"` with `np.nan`.
2. By using `game_location` column (`"H": Home, "A": Away`) and `game_result` column (`"W": Win, "L": Lose`), evaluate the probabilty of winning given that it is a home match.
3. Evaluate the probability of winning given that it is an away match for each team (identified by `team_id`, using `.groupby()` method).

In [2]:
import numpy as np
import pandas as pd
nba = pd.read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/nba-elo/nbaallelo.csv")

In [3]:
nba.notes = nba.notes.replace("NULL",np.nan)

In [4]:
prob_H_W = nba.game_id[(nba.game_location=="H")&(nba.game_result=="W")].count() / \
nba.game_id[nba.game_location=="H"].count()

In [5]:
print(prob_H_W)

0.6225252621242358


In [6]:
"""define a function to evaluate the chance of winning for the data from the same team"""
def prob_A_W(x):
    count_away_win = x.game_id[(x.game_location=="A")&(x.game_result=="W")].count()
    count_away = x.game_id[x.game_location=="A"].count()
    return count_away_win / count_away

prob_by_team = nba.groupby("team_id").apply(prob_A_W)
print(prob_by_team)

team_id
ANA    0.216216
AND    0.382353
ATL    0.361277
BAL    0.368421
BLB    0.222222
         ...   
WAS    0.310298
WAT    0.096774
WSA    0.403509
WSB    0.334002
WSC    0.409722
Length: 104, dtype: float64
