# Adding and Removing Data

## About the Data
In this notebook, we will be working with FIFA players data for 2022 obtained from [Kaggle](https://www.kaggle.com/datasets/stefanoleone992/fifa-22-complete-player-dataset)

## Setup
We will be working with the `players_22.csv` file, so we need to handle our imports and read it in.

In [1]:
import pandas as pd

In [4]:
players = pd.read_csv(
    'players_22.csv', 
    usecols=['short_name', 'wage_eur', 'age', 'club_name', 'nationality_name', 'preferred_foot']
)

## Creating new data
### Adding new columns
New columns get added to the right of the original columns and can be a single value, which will be **broadcasted** along the rows of the dataframe:

In [5]:
players['data_source'] = 'SO FIFA WEBSITE'
players.head()

Unnamed: 0,short_name,wage_eur,age,club_name,nationality_name,preferred_foot,data_source
0,L. Messi,320000.0,34,Paris Saint-Germain,Argentina,Left,SO FIFA WEBSITE
1,R. Lewandowski,270000.0,32,FC Bayern München,Poland,Right,SO FIFA WEBSITE
2,Cristiano Ronaldo,270000.0,36,Manchester United,Portugal,Right,SO FIFA WEBSITE
3,Neymar Jr,270000.0,29,Paris Saint-Germain,Brazil,Right,SO FIFA WEBSITE
4,K. De Bruyne,350000.0,30,Manchester City,Belgium,Right,SO FIFA WEBSITE


...or a Boolean mask:

In [6]:
players['player_younger_than_18'] = players.age < 18
players.head()

Unnamed: 0,short_name,wage_eur,age,club_name,nationality_name,preferred_foot,data_source,player_younger_than_18
0,L. Messi,320000.0,34,Paris Saint-Germain,Argentina,Left,SO FIFA WEBSITE,False
1,R. Lewandowski,270000.0,32,FC Bayern München,Poland,Right,SO FIFA WEBSITE,False
2,Cristiano Ronaldo,270000.0,36,Manchester United,Portugal,Right,SO FIFA WEBSITE,False
3,Neymar Jr,270000.0,29,Paris Saint-Germain,Brazil,Right,SO FIFA WEBSITE,False
4,K. De Bruyne,350000.0,30,Manchester City,Belgium,Right,SO FIFA WEBSITE,False


#### Using the `assign()` method to create columns
To create many columns at once or update existing columns, we can use `assign()`:

In [8]:
players.assign(
    from_france = players.nationality_name.str.contains('France'),
    from_argentina = players.nationality_name.str.contains('Argentina')
).head()

Unnamed: 0,short_name,wage_eur,age,club_name,nationality_name,preferred_foot,data_source,player_younger_than_18,from_france,from_argentina
0,L. Messi,320000.0,34,Paris Saint-Germain,Argentina,Left,SO FIFA WEBSITE,False,False,True
1,R. Lewandowski,270000.0,32,FC Bayern München,Poland,Right,SO FIFA WEBSITE,False,False,False
2,Cristiano Ronaldo,270000.0,36,Manchester United,Portugal,Right,SO FIFA WEBSITE,False,False,False
3,Neymar Jr,270000.0,29,Paris Saint-Germain,Brazil,Right,SO FIFA WEBSITE,False,False,False
4,K. De Bruyne,350000.0,30,Manchester City,Belgium,Right,SO FIFA WEBSITE,False,False,False


#### Concatenation
Say we were working with two separate dataframes, one with right footed players and the other with left footed players. If we wanted to look at all the players, we would have to concatenate the dataframes into a single one:

In [9]:
right_foot = players[players.preferred_foot == 'Right']
left_foot = players[players.preferred_foot == 'Left']

right_foot.shape, left_foot.shape

((14674, 8), (4565, 8))

The `concat()` method by default works along the row axis (`axis=0`) and is equivalent to appending to the bottom. By concatenating the right foot and left foot dataframe, we get the full players dataset back:

In [10]:
pd.concat([right_foot, left_foot]).shape

(19239, 8)

Note that the previous result is equivalent to running the `append()` method of the dataframe:

In [11]:
right_foot.append(left_foot).shape

(19239, 8)

We have been working with a subset of the columns from the CSV file, but suppose that now we want to get some of the columns we ignored when we read in the data. Since we have added new columns in this notebook, we won't want to read in the file and perform those operations again. Instead, we will concatenate along the columns (`axis=1`) to add back what we are missing:

In [12]:
additional_columns = pd.read_csv(
    'players_22.csv', usecols=['player_positions', 'potential', 'overall']
)

In [13]:
pd.concat([players, additional_columns], axis=1).head()

Unnamed: 0,short_name,wage_eur,age,club_name,nationality_name,preferred_foot,data_source,player_younger_than_18,player_positions,overall,potential
0,L. Messi,320000.0,34,Paris Saint-Germain,Argentina,Left,SO FIFA WEBSITE,False,"RW, ST, CF",93,93
1,R. Lewandowski,270000.0,32,FC Bayern München,Poland,Right,SO FIFA WEBSITE,False,ST,92,92
2,Cristiano Ronaldo,270000.0,36,Manchester United,Portugal,Right,SO FIFA WEBSITE,False,"ST, LW",91,91
3,Neymar Jr,270000.0,29,Paris Saint-Germain,Brazil,Right,SO FIFA WEBSITE,False,"LW, CAM",91,91
4,K. De Bruyne,350000.0,30,Manchester City,Belgium,Right,SO FIFA WEBSITE,False,"CM, CAM",91,91


## Deleting Unwanted Data
Columns can be deleted using dictionary syntax with `del`:

In [14]:
del players['data_source']
players.columns

Index(['short_name', 'wage_eur', 'age', 'club_name', 'nationality_name',
       'preferred_foot', 'player_younger_than_18'],
      dtype='object')

We can also use `pop()`. This will allow us to use the series we remove later.

In [15]:
player_younger_than_18 = players.pop('player_younger_than_18')
players.columns

Index(['short_name', 'wage_eur', 'age', 'club_name', 'nationality_name',
       'preferred_foot'],
      dtype='object')

Notice we have a mask in `player_younger_than_18` now:

In [16]:
player_younger_than_18.value_counts()

False    18948
True       291
Name: player_younger_than_18, dtype: int64

Now, we can use `player_younger_than_18` to filter our data:

In [17]:
players[player_younger_than_18].head()

Unnamed: 0,short_name,wage_eur,age,club_name,nationality_name,preferred_foot
3061,R. Cherki,9000.0,17,Olympique Lyonnais,France,Left
4575,A. Karabec,500.0,17,AC Sparta Praha,Czech Republic,Left
5434,L. Gourna,700.0,17,AS Saint-Étienne,France,Right
5464,D. Samek,500.0,17,SK Slavia Praha,Czech Republic,Right
7335,K. Kozłowski,500.0,17,Pogoń Szczecin,Poland,Right


### Using the `drop()` method
We can drop rows by passing a list of indices to the `drop()` method. Notice in the following example that when asking for the first 2 rows with `head()` we get the 3rd and 4th rows because we dropped the original first 2 with `drop([0, 1])`:

In [19]:
players.drop([0, 1]).head(2)

Unnamed: 0,short_name,wage_eur,age,club_name,nationality_name,preferred_foot
2,Cristiano Ronaldo,270000.0,36,Manchester United,Portugal,Right
3,Neymar Jr,270000.0,29,Paris Saint-Germain,Brazil,Right


The `drop()` method drops along the row axis by default. If we pass in a list of columns with the `columns` argument, we can delete columns:

In [21]:
players.drop(columns=['nationality_name', 'preferred_foot']).head()

Unnamed: 0,short_name,wage_eur,age,club_name
0,L. Messi,320000.0,34,Paris Saint-Germain
1,R. Lewandowski,270000.0,32,FC Bayern München
2,Cristiano Ronaldo,270000.0,36,Manchester United
3,Neymar Jr,270000.0,29,Paris Saint-Germain
4,K. De Bruyne,350000.0,30,Manchester City


We also have the option of using `axis=1`:

In [23]:
players.drop(['nationality_name', 'preferred_foot'], axis=1).head()

Unnamed: 0,short_name,wage_eur,age,club_name
0,L. Messi,320000.0,34,Paris Saint-Germain
1,R. Lewandowski,270000.0,32,FC Bayern München
2,Cristiano Ronaldo,270000.0,36,Manchester United
3,Neymar Jr,270000.0,29,Paris Saint-Germain
4,K. De Bruyne,350000.0,30,Manchester City


By default, `drop()`, along with the majority of `DataFrame` methods, will return a new `DataFrame` object. If we just want to change the one we are working with, we can pass `inplace=True`. This should be used with care:

In [24]:
players.drop(columns=['nationality_name', 'preferred_foot'], inplace=True)
players.head()

Unnamed: 0,short_name,wage_eur,age,club_name
0,L. Messi,320000.0,34,Paris Saint-Germain
1,R. Lewandowski,270000.0,32,FC Bayern München
2,Cristiano Ronaldo,270000.0,36,Manchester United
3,Neymar Jr,270000.0,29,Paris Saint-Germain
4,K. De Bruyne,350000.0,30,Manchester City
