# Data Cleaning and Transformation

1. **`df.apply()`:**
   - Purpose: This method applies a function along either the rows or columns of the DataFrame.
   - Usage: `df.apply(func, axis=0, ...)` or `df.apply(func, axis=1, ...)`
   - Parameters:
     - `func`: The function to be applied to each row or column.
     - `axis`: Specifies the axis along which the function is applied. Use `axis=0` for column-wise and `axis=1` for row-wise application.

2. **`df.replace()`:**
   - Purpose: This method is used to replace specific values in the DataFrame with other values.
   - Usage: `df.replace(to_replace, value, inplace=False, ...)`
   - Parameters:
     - `to_replace`: The value or list of values to be replaced.
     - `value`: The value or list of values to replace `to_replace` with.
     - `inplace`: If True, the DataFrame is modified in place (i.e., no new DataFrame is created). If False (default), a new DataFrame with replaced values is returned.

3. **`df.duplicated()`:**
   - Purpose: This method identifies duplicate rows in the DataFrame and returns a boolean Series indicating whether each row is a duplicate.
   - Usage: `df.duplicated(subset=None, keep='first')`
   - Parameters:
     - `subset`: A list of column names to consider for identifying duplicates. If not provided, all columns are used.
     - `keep`: Determines which duplicates, if any, to mark as `True`. It can take the following values:
       - 'first' (default): Mark all duplicates as `True` except the first occurrence.
       - 'last': Mark all duplicates as `True` except the last occurrence.
       - False: Mark all duplicates as `True`.

4. **`df.drop_duplicates()`:**
   - Purpose: This method drops duplicate rows from the DataFrame and returns a new DataFrame with the duplicates removed.
   - Usage: `df.drop_duplicates(subset=None, keep='first', inplace=False)`
   - Parameters:
     - `subset`: A list of column names to consider for identifying duplicates. If not provided, all columns are used.
     - `keep`: Determines which duplicates, if any, to keep. It can take the same values as in `df.duplicated()`.
     - `inplace`: If True, the DataFrame is modified in place (i.e., no new DataFrame is created). If False (default), a new DataFrame with duplicates removed is returned.


In [2]:
import pandas as pd
import numpy as np
df = pd.read_csv("data/nba.csv")
df.head()

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,,5000000.0


**`df.apply`**

In [3]:
df.apply(lambda x : x[["Team"]].str.lower(), axis = "columns") # Convert 'Team' column values to lowercase, x refers to each row in the DataFrame column Team

Unnamed: 0,Team
0,boston celtics
1,boston celtics
2,boston celtics
3,boston celtics
4,boston celtics
...,...
453,utah jazz
454,utah jazz
455,utah jazz
456,utah jazz


In [None]:
df.head() # the column 'Team' is still in uppercase because the changes were not saved back to the DataFrame

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,,5000000.0


**`df.replace()`**

In [None]:
df["Team"] = df['Team'].str.replace("Boston", "BOSTON") # Replace 'Boston' with 'BOSTON' in the 'Team' column

df.head() # Now the changes are reflected in the DataFrame because we saved it back to the 'Team' column

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,BOSTON Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,BOSTON Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,BOSTON Celtics,30.0,SG,27.0,6-5,205.0,Boston University,
3,R.J. Hunter,BOSTON Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0
4,Jonas Jerebko,BOSTON Celtics,8.0,PF,29.0,6-10,231.0,,5000000.0


**`df.duplicated()`**

In [None]:
df['duplicate'] = df.duplicated(subset=["Team"], keep='first') # Identify duplicate values in the 'Team' column, marking duplicates as True except for the first occurrence
df.head(3) # The 'duplicate' column indicates whether the 'Team' value is a duplicate (True) or not (False)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary,duplicate
0,Avery Bradley,BOSTON Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0,False
1,Jae Crowder,BOSTON Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0,True
2,John Holland,BOSTON Celtics,30.0,SG,27.0,6-5,205.0,Boston University,,True


**`df.drop_duplicates()`**

In [9]:
#drop duplicates on all columns
print("Original DataFrame shape:", df.shape)
df_distinct = df.drop_duplicates()
print("DataFrame shape after removing duplicates:", df_distinct.shape) #the results are the same because there were no duplicates accross all columns

Original DataFrame shape: (458, 10)
DataFrame shape after removing duplicates: (458, 10)


In [11]:
df_distinct = df.drop_duplicates(subset=["Team"], keep='first') # Drop duplicate rows based on the 'Team' column, keeping the first occurrence

print("DataFrame shape after removing duplicates:", df_distinct.shape) # Now the shape will be different as duplicates in the 'Team' column are removed
df_distinct.head(3)

DataFrame shape after removing duplicates: (31, 10)


Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary,duplicate
0,Avery Bradley,BOSTON Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0,False
15,Bojan Bogdanovic,Brooklyn Nets,44.0,SG,27.0,6-8,216.0,,3425510.0,False
30,Arron Afflalo,New York Knicks,4.0,SG,30.0,6-5,210.0,UCLA,8000000.0,False


Next Chapter [Merging and Joining Dataframes](7.MergingJoiningDataFrames.ipynb)