## Data Manipulation

1. **`df.drop()`:** Remove specified rows or columns from the DataFrame.

2. **`df.rename()`:** Rename columns or index labels.

3. **`df.sort_values()`:** Sort the DataFrame by specified columns.

4. **`df.groupby()`:** Group data based on one or more columns.

5. **`df.get_group()`:** Get a specific group from a grouped DataFrame using the group key.

6. **`df['a'].astype('data_type')`:** Convert the data type of a specific column 'a' to the specified data type.

7. **`df.set_index()`:** Set the DataFrame's index to a specific column.

8. **`df.reset_index()`:** Reset the DataFrame's index, optionally dropping the current index.

10. **`df.agg()`:** Perform aggregation operations on the DataFrame, like mean, sum, etc.

11. **`df.map()`:** Apply a function element-wise on a Series.

12. **`df.rank()`:** Compute numerical rank of elements in the DataFrame.

In [1]:
import pandas as pd
import numpy as np

In [46]:
#read the data
df = pd.read_csv("data/nba.csv") # Load NBA dataset into a DataFrame

df.head()

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,,5000000.0


**df.drop()**

In [3]:
df_new = df.drop(columns=["Height"], axis=1) # Drop the 'Height' column from the DataFrame
df_new.head()

Unnamed: 0,Name,Team,Number,Position,Age,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,205.0,Boston University,
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,185.0,Georgia State,1148640.0
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,231.0,,5000000.0


**df.rename()**

In [4]:
#rename College to University and Salary to Wages -> column renaming
df_renamed = df.rename(columns= {"College":"University", "Salary":"Wages"})

df_renamed.head(3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,University,Wages
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,


In [5]:
#index renaming -> 0 and 1 to x and y
df_renamed = df.rename(index= {0:"x", 1:"y"})

df_renamed.head(3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
x,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
y,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,


In [6]:
#index -> map all index to string
df_renamed = df.rename(index = str)

df_renamed.head(3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,


**`sort_values()`**

In [7]:
#sort the values in by salary in descending and by name ascending
df_sorted = df.sort_values(by=["Salary", "Name"], ascending=[False, True])

df_sorted.head(8)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
109,Kobe Bryant,Los Angeles Lakers,24.0,SF,37.0,6-6,212.0,,25000000.0
169,LeBron James,Cleveland Cavaliers,23.0,SF,31.0,6-8,250.0,,22970500.0
33,Carmelo Anthony,New York Knicks,7.0,SF,32.0,6-8,240.0,Syracuse,22875000.0
251,Dwight Howard,Houston Rockets,12.0,C,30.0,6-11,265.0,,22359364.0
339,Chris Bosh,Miami Heat,1.0,PF,32.0,6-11,235.0,Georgia Tech,22192730.0
100,Chris Paul,Los Angeles Clippers,3.0,PG,31.0,6-0,175.0,Wake Forest,21468695.0
414,Kevin Durant,Oklahoma City Thunder,35.0,SF,27.0,6-9,240.0,Texas,20158622.0
164,Derrick Rose,Chicago Bulls,1.0,PG,27.0,6-3,190.0,Memphis,20093064.0


**`groupby()`**

In [8]:
#find mean of salary for each Team
df_grpd = df.groupby(by=["Team"], as_index=False)["Salary"].mean()

df_grpd.head(4)

Unnamed: 0,Team,Salary
0,Atlanta Hawks,4860197.0
1,Boston Celtics,4181505.0
2,Brooklyn Nets,3501898.0
3,Charlotte Hornets,5222728.0


In [9]:
#find avg, min, max and range of age of Age column for each team
df_grpd = df.groupby(by=["Team"], as_index=False).agg({"Age" : ["mean", "min", "max", lambda x : x.max()- x.min()]})
df_grpd.rename(columns={"<lambda_0>":"Range"}, inplace=True) # Rename the lambda column to 'Range'
df_grpd.head()

Unnamed: 0_level_0,Team,Age,Age,Age,Age
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,min,max,Range
0,Atlanta Hawks,28.2,22.0,35.0,13.0
1,Boston Celtics,24.733333,20.0,29.0,9.0
2,Brooklyn Nets,25.6,21.0,32.0,11.0
3,Charlotte Hornets,26.133333,21.0,31.0,10.0
4,Chicago Bulls,27.4,21.0,35.0,14.0


**`get_group`**

In [10]:
df_grpd = df.groupby("Team") # Group by 'Team' without resetting index
df_grpd.head() 
df_grpd.get_group('Atlanta Hawks')['Age'].mean() # Get the mean age for the 'Atlanta Hawks' team

np.float64(28.2)

**`df['a'].astype('data_type')`**

In [11]:
#convert salary to nearest integer
df["Salary"] = df["Salary"].fillna(0).astype('int64') # Convert 'Salary' column to integer, filling NaN with 0

df.head()

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,0
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,,5000000


**`df.set_index()`**

In [12]:
#set Team as index

df_index = df.set_index(keys=["Team"])

df_index.head(4)

Unnamed: 0_level_0,Name,Number,Position,Age,Height,Weight,College,Salary
Team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Boston Celtics,Avery Bradley,0.0,PG,25.0,6-2,180.0,Texas,7730337
Boston Celtics,Jae Crowder,99.0,SF,25.0,6-6,235.0,Marquette,6796117
Boston Celtics,John Holland,30.0,SG,27.0,6-5,205.0,Boston University,0
Boston Celtics,R.J. Hunter,28.0,SG,22.0,6-5,185.0,Georgia State,1148640


**`df.reset_index()`**

In [13]:
df_non_index = df_index.reset_index(level=None) # Reset index to default integer index
df_non_index.head(3)

Unnamed: 0,Team,Name,Number,Position,Age,Height,Weight,College,Salary
0,Boston Celtics,Avery Bradley,0.0,PG,25.0,6-2,180.0,Texas,7730337
1,Boston Celtics,Jae Crowder,99.0,SF,25.0,6-6,235.0,Marquette,6796117
2,Boston Celtics,John Holland,30.0,SG,27.0,6-5,205.0,Boston University,0


**`df.agg()`**

In [14]:
#find avg age of the players

df["mean_age"] = df["Age"].fillna(0).agg("mean")

df.head(3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary,mean_age
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337,26.879913
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117,26.879913
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,0,26.879913


**`df.map()`**

In [15]:
df["Position"] = df["Position"].map({"PG":"pg"}) # Map 'Position' values to lowercase abbreviations

df.head()

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary,mean_age
0,Avery Bradley,Boston Celtics,0.0,pg,25.0,6-2,180.0,Texas,7730337,26.879913
1,Jae Crowder,Boston Celtics,99.0,,25.0,6-6,235.0,Marquette,6796117,26.879913
2,John Holland,Boston Celtics,30.0,,27.0,6-5,205.0,Boston University,0,26.879913
3,R.J. Hunter,Boston Celtics,28.0,,22.0,6-5,185.0,Georgia State,1148640,26.879913
4,Jonas Jerebko,Boston Celtics,8.0,,29.0,6-10,231.0,,5000000,26.879913


**`df.rank()`**

In [None]:
#add rank column for each team wrt Salary
df =df.dropna(subset=["Salary"]) # Drop rows with NaN in 'Salary' column to avoid issues during ranking
# Rank players within each team by 'Salary' in descending order
df["rank"] = df.groupby("Team")["Salary"].rank(ascending=False).astype('int64') # Convert rank to integer type

df.head(10) #The highest salary player in each team has rank 1

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary,rank
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0,2
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0,4
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0,14
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,,5000000.0,5
5,Amir Johnson,Boston Celtics,90.0,PF,29.0,6-9,240.0,,12000000.0,1
6,Jordan Mickey,Boston Celtics,55.0,PF,21.0,6-8,235.0,LSU,1170960.0,13
7,Kelly Olynyk,Boston Celtics,41.0,C,25.0,7-0,238.0,Gonzaga,2165160.0,10
8,Terry Rozier,Boston Celtics,12.0,PG,22.0,6-2,190.0,Louisville,1824360.0,11
9,Marcus Smart,Boston Celtics,36.0,PG,22.0,6-4,220.0,Oklahoma State,3431040.0,6
10,Jared Sullinger,Boston Celtics,7.0,C,24.0,6-9,260.0,Ohio State,2569260.0,9


Next Chapter [Missing Data Handling](5.MissingDataHandling.ipynb)