# Manipulating Data

You can manipulate data with help of different methods, some of the things you can do are:
* Change text
* Perform math operations
* Sort data

In [1]:
# TODOs
# Sort Data
# Filter data

## How to SUM columns in Pandas
Adding up columns in pandas is easy, just use the .sum() method.

In [2]:
import pandas as pd

data = [['Puente, Jose',10, 'Watch anime'],['Madrigal, Josefina',15, "Build models"],['Gaona, Joseph',12, 'Dancing'],['Johnson, Jotaro',13, 'Play Videogames'], ['Chisaka, Joseline',16, 'Running'],['Issawi, Jorge', 14, 'Play with animals'],['Dalal, Joe',17,'Reading']]
df = pd.DataFrame(data,columns=['name','age','favorite hobby'])
df

Unnamed: 0,name,age,favorite hobby
0,"Puente, Jose",10,Watch anime
1,"Madrigal, Josefina",15,Build models
2,"Gaona, Joseph",12,Dancing
3,"Johnson, Jotaro",13,Play Videogames
4,"Chisaka, Joseline",16,Running
5,"Issawi, Jorge",14,Play with animals
6,"Dalal, Joe",17,Reading


In [3]:
jojos_total_years_sum = df['age'].sum()
print(f'The total years sum is {jojos_total_years_sum:.2f}')

The total years sum is 97.00


## Mean
We use the mean() method, this is sorta becoming repetitive.

In [4]:
jojos_age_mean = df['age'].mean()
print(f'The average age of these JoJo\'s is: {jojos_age_mean:.2f}')

The average age of these JoJo's is: 13.86


## Replace Text
This is how you replace text on a single column with Pandas.

Notice that the str.replace method returns just the column without changing te original dataframe, so we have to assign the returned value to the original dataframe column.

In [5]:
df['favorite hobby'] = df['favorite hobby'].str.replace('anime','cartoons')

In [6]:
df

Unnamed: 0,name,age,favorite hobby
0,"Puente, Jose",10,Watch cartoons
1,"Madrigal, Josefina",15,Build models
2,"Gaona, Joseph",12,Dancing
3,"Johnson, Jotaro",13,Play Videogames
4,"Chisaka, Joseline",16,Running
5,"Issawi, Jorge",14,Play with animals
6,"Dalal, Joe",17,Reading


## Format text
We can also format text with string methods like str.upper(), str.lower(), str.capitalize() & str.title()

In [7]:
df['name'] = df['name'].str.upper()
df

Unnamed: 0,name,age,favorite hobby
0,"PUENTE, JOSE",10,Watch cartoons
1,"MADRIGAL, JOSEFINA",15,Build models
2,"GAONA, JOSEPH",12,Dancing
3,"JOHNSON, JOTARO",13,Play Videogames
4,"CHISAKA, JOSELINE",16,Running
5,"ISSAWI, JORGE",14,Play with animals
6,"DALAL, JOE",17,Reading


## Add a new row or filter duplicate rows
We can add new elements to our dataframe with **loc** property, loc is a complex property which could have its own section, for our purposes we will just be passing it the length of our DataFrame to tell it we want our new row at the end.

In [8]:
df.loc[len(df.index)] = ['PUENTE, JOSE',10,'Watch cartoons']
df

Unnamed: 0,name,age,favorite hobby
0,"PUENTE, JOSE",10,Watch cartoons
1,"MADRIGAL, JOSEFINA",15,Build models
2,"GAONA, JOSEPH",12,Dancing
3,"JOHNSON, JOTARO",13,Play Videogames
4,"CHISAKA, JOSELINE",16,Running
5,"ISSAWI, JORGE",14,Play with animals
6,"DALAL, JOE",17,Reading
7,"PUENTE, JOSE",10,Watch cartoons


**duplicated()** gives us a list of booleans where *True* only appears for a row that has previously appeared, in our case that's our last one (7)

In [9]:
df.duplicated()

0    False
1    False
2    False
3    False
4    False
5    False
6    False
7     True
dtype: bool

Pandas has a built-in method to drop duplicate rows, simply called **.drop_duplicates**. The **inplace=True** argument is telling panda to change our original DataFrame.

In [10]:
df.drop_duplicates(inplace=True)
df

Unnamed: 0,name,age,favorite hobby
0,"PUENTE, JOSE",10,Watch cartoons
1,"MADRIGAL, JOSEFINA",15,Build models
2,"GAONA, JOSEPH",12,Dancing
3,"JOHNSON, JOTARO",13,Play Videogames
4,"CHISAKA, JOSELINE",16,Running
5,"ISSAWI, JORGE",14,Play with animals
6,"DALAL, JOE",17,Reading


## Splitting a text column with a delimiter


In [11]:
df[['Last name', 'First name']] = df['name'].str.split(',', 1, expand=True)
df

Unnamed: 0,name,age,favorite hobby,Last name,First name
0,"PUENTE, JOSE",10,Watch cartoons,PUENTE,JOSE
1,"MADRIGAL, JOSEFINA",15,Build models,MADRIGAL,JOSEFINA
2,"GAONA, JOSEPH",12,Dancing,GAONA,JOSEPH
3,"JOHNSON, JOTARO",13,Play Videogames,JOHNSON,JOTARO
4,"CHISAKA, JOSELINE",16,Running,CHISAKA,JOSELINE
5,"ISSAWI, JORGE",14,Play with animals,ISSAWI,JORGE
6,"DALAL, JOE",17,Reading,DALAL,JOE


Now our DataFrame looks out of order, so let's tidy up, we will delete the original *name* column and put *First name* first and then our *Last name* columns.

In [12]:
# Drop the name column
df.drop(['name'], axis=1)
# Reorder columns
df = df[['First name', 'Last name','age','favorite hobby']]
df

Unnamed: 0,First name,Last name,age,favorite hobby
0,JOSE,PUENTE,10,Watch cartoons
1,JOSEFINA,MADRIGAL,15,Build models
2,JOSEPH,GAONA,12,Dancing
3,JOTARO,JOHNSON,13,Play Videogames
4,JOSELINE,CHISAKA,16,Running
5,JORGE,ISSAWI,14,Play with animals
6,JOE,DALAL,17,Reading


## Sort data
We can sort a DataFrame by values on a column with the **.sort_values** method, this method can take multiple arguments, for now we will use the column name.

In [13]:
df = df.sort_values(by='age')
df

Unnamed: 0,First name,Last name,age,favorite hobby
0,JOSE,PUENTE,10,Watch cartoons
2,JOSEPH,GAONA,12,Dancing
3,JOTARO,JOHNSON,13,Play Videogames
5,JORGE,ISSAWI,14,Play with animals
1,JOSEFINA,MADRIGAL,15,Build models
4,JOSELINE,CHISAKA,16,Running
6,JOE,DALAL,17,Reading


## Filter data
Data cleaning and filtering is one of the most common uses for spreadsheets or DataFrames and is an integral part of data analysis and analytics.

Data filtering allows us to focus on just the data we want to perform analysis or visualize.

For this next example, we will import and filter a csv to see which Pokemons are legendary and have a Speed attribute above the median.

In [21]:
pokemon_legendaries = pd.read_csv('datasets/pokemon.csv')
pokemon_legendaries.head()
#pokemon_legendaries.Legendary.unique()

Unnamed: 0,#,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False
3,4,Mega Venusaur,Grass,Poison,80,100,123,122,120,80,1,False
4,5,Charmander,Fire,,39,52,43,60,50,65,1,False


We can actually see that we have a column named 'Legendary' with *False* values, however at this stage we don't know if there are missing values or we have other values in there, so we will use the **.unique()** method to display what unique values are present in the Legendary column.

In [22]:
pokemon_legendaries.Legendary.unique()

array([False,  True])