# Manipulating a dataset structure
Applying some operations to change DataFrame strucutre and filtering content

In [9]:
# 1-) Import Pandas
import pandas as pd

# 2-) Read `vehicles.csv`
df = pd.read_csv('./csv/vehicles.csv')
# 3-) Make some quick analysis
print(df.describe())
print(df.info())
df.head(5)

              Year  Engine Displacement     Cylinders  Fuel Barrels/Year  \
count  35952.00000         35952.000000  35952.000000       35952.000000   
mean    2000.71640             3.338493      5.765076          17.609056   
std       10.08529             1.359395      1.755268           4.467283   
min     1984.00000             0.600000      2.000000           0.060000   
25%     1991.00000             2.200000      4.000000          14.699423   
50%     2001.00000             3.000000      6.000000          17.347895   
75%     2010.00000             4.300000      6.000000          20.600625   
max     2017.00000             8.400000     16.000000          47.087143   

           City MPG   Highway MPG  Combined MPG  CO2 Emission Grams/Mile  \
count  35952.000000  35952.000000  35952.000000             35952.000000   
mean      17.646139     23.880646     19.929322               475.316339   
std        4.769349      5.890876      5.112409               119.060773   
min        

Unnamed: 0,Make,Model,Year,Engine Displacement,Cylinders,Transmission,Drivetrain,Vehicle Class,Fuel Type,Fuel Barrels/Year,City MPG,Highway MPG,Combined MPG,CO2 Emission Grams/Mile,Fuel Cost/Year
0,AM General,DJ Po Vehicle 2WD,1984,2.5,4.0,Automatic 3-spd,2-Wheel Drive,Special Purpose Vehicle 2WD,Regular,19.388824,18,17,17,522.764706,1950
1,AM General,FJ8c Post Office,1984,4.2,6.0,Automatic 3-spd,2-Wheel Drive,Special Purpose Vehicle 2WD,Regular,25.354615,13,13,13,683.615385,2550
2,AM General,Post Office DJ5 2WD,1985,2.5,4.0,Automatic 3-spd,Rear-Wheel Drive,Special Purpose Vehicle 2WD,Regular,20.600625,16,17,16,555.4375,2100
3,AM General,Post Office DJ8 2WD,1985,4.2,6.0,Automatic 3-spd,Rear-Wheel Drive,Special Purpose Vehicle 2WD,Regular,25.354615,13,13,13,683.615385,2550
4,ASC Incorporated,GNX,1987,3.8,6.0,Automatic 4-spd,Rear-Wheel Drive,Midsize Cars,Premium,20.600625,14,21,16,555.4375,2550


## Check DataFrame Column Names

Rename all columns at once:
- `data.columns` is an **attribute** of the DataFrame which results in a list-like of the column names
    - You can substitute it by another list containing the names you want 
    - Note you have to substitute the whole set of column names at once
    
- `data.rename()` is a **method** of a DataFrame, in which you can rename one column at once
    - You just need to pass a dictionary containing {'old_name':'new_name'} 
    - By default, it changes names of a **index** (`axis=0`), you can specify `axis=1` to change **column** names
    - the `inplace` argument

### Substituting `.columns` attribute

In [11]:
# Example: Let's convert all columns to lowercase and put underscore instead of spaces
df.columns = [column.lower().replace(' ', '_') for column in df.columns]

In [13]:
df.highway_mpg

0        17
1        13
2        17
3        13
4        21
         ..
35947    38
35948    38
35949    38
35950    39
35951    39
Name: highway_mpg, Length: 35952, dtype: int64

### Using `.rename() method`

`.rename({'old_column':'new_column'})`

In [19]:
# Example: Let's rename `Year` to `model_year` 
df.rename(columns = {'year': 'model_year'}, inplace = False )
df.columns

Index(['make', 'model', 'model_year', 'engine_displacement', 'cylinders',
       'transmission', 'drivetrain', 'vehicle_class', 'fuel_type',
       'fuel_barrels/year', 'city_mpg', 'highway_mpg', 'combined_mpg',
       'co2_emission_grams/mile', 'fuel_cost/year'],
      dtype='object')

In [21]:
# Example 2: Let's use dict comprehension to create a dict where we replace all column names
df.rename(columns = {column: column.upper() for column in df.columns}).head(2)

Unnamed: 0,MAKE,MODEL,MODEL_YEAR,ENGINE_DISPLACEMENT,CYLINDERS,TRANSMISSION,DRIVETRAIN,VEHICLE_CLASS,FUEL_TYPE,FUEL_BARRELS/YEAR,CITY_MPG,HIGHWAY_MPG,COMBINED_MPG,CO2_EMISSION_GRAMS/MILE,FUEL_COST/YEAR
0,AM General,DJ Po Vehicle 2WD,1984,2.5,4.0,Automatic 3-spd,2-Wheel Drive,Special Purpose Vehicle 2WD,Regular,19.388824,18,17,17,522.764706,1950
1,AM General,FJ8c Post Office,1984,4.2,6.0,Automatic 3-spd,2-Wheel Drive,Special Purpose Vehicle 2WD,Regular,25.354615,13,13,13,683.615385,2550


### Using `.rename() method` with inplace=True 

In [9]:
# Your code here!

What happens if you try to assign to a variable a result with an `inplace=True` command?

In [10]:
# Your code here!

So, we have two options:
> 1. store it again on the variable `data`: 

    data = data.rename(columns={'Make':'Manufacturer', 'Year':'ANO'})
> 2. Use the inplace argument `inplace =  True` to change the values within the dataframe automatically

    data.rename(columns={'Make':'Manufacturer', 'Year':'ANO'}, inplace=True)
    

## Reorder columns in a dataframe

Remember: You always pass a list of columns to access a dataframe

In [22]:
# Just select the columns in a different order and overwrite the previous dataframe
import numpy as np

In [25]:
COLUMNS = np.sort(df.columns)
df[COLUMNS].head(2)

Unnamed: 0,city_mpg,co2_emission_grams/mile,combined_mpg,cylinders,drivetrain,engine_displacement,fuel_barrels/year,fuel_cost/year,fuel_type,highway_mpg,make,model,model_year,transmission,vehicle_class
0,18,522.764706,17,4.0,2-Wheel Drive,2.5,19.388824,1950,Regular,17,AM General,DJ Po Vehicle 2WD,1984,Automatic 3-spd,Special Purpose Vehicle 2WD
1,13,683.615385,13,6.0,2-Wheel Drive,4.2,25.354615,2550,Regular,13,AM General,FJ8c Post Office,1984,Automatic 3-spd,Special Purpose Vehicle 2WD


In [27]:
df.sort_index(axis=1).head(2)

Unnamed: 0,city_mpg,co2_emission_grams/mile,combined_mpg,cylinders,drivetrain,engine_displacement,fuel_barrels/year,fuel_cost/year,fuel_type,highway_mpg,make,model,model_year,transmission,vehicle_class
0,18,522.764706,17,4.0,2-Wheel Drive,2.5,19.388824,1950,Regular,17,AM General,DJ Po Vehicle 2WD,1984,Automatic 3-spd,Special Purpose Vehicle 2WD
1,13,683.615385,13,6.0,2-Wheel Drive,4.2,25.354615,2550,Regular,13,AM General,FJ8c Post Office,1984,Automatic 3-spd,Special Purpose Vehicle 2WD


## Remove column (or row)

- The `.drop()` method
- By default, `.drop()` drops a row given its index.

In [28]:
# Your code here!
df.drop(columns = ['cylinders', 'drivetrain']).head(2)

Unnamed: 0,make,model,model_year,engine_displacement,transmission,vehicle_class,fuel_type,fuel_barrels/year,city_mpg,highway_mpg,combined_mpg,co2_emission_grams/mile,fuel_cost/year
0,AM General,DJ Po Vehicle 2WD,1984,2.5,Automatic 3-spd,Special Purpose Vehicle 2WD,Regular,19.388824,18,17,17,522.764706,1950
1,AM General,FJ8c Post Office,1984,4.2,Automatic 3-spd,Special Purpose Vehicle 2WD,Regular,25.354615,13,13,13,683.615385,2550


## Deep vs Shallow copy on pandas

In [31]:
# Your code here!
df2 = df #Shallow copy
df is df

True

In [32]:
df3 = df.copy()
df3 is df

False

## Sort Values in a DatFrame

In [37]:
# Your code here!
df.sort_values(by=['model_year', 'highway_mpg']).head(2)

Unnamed: 0,make,model,model_year,engine_displacement,cylinders,transmission,drivetrain,vehicle_class,fuel_type,fuel_barrels/year,city_mpg,highway_mpg,combined_mpg,co2_emission_grams/mile,fuel_cost/year
9068,Dodge,B350 Wagon 2WD,1984,5.9,8.0,Automatic 3-spd,2-Wheel Drive,"Vans, Passenger Type",Regular,36.623333,8,10,9,987.444444,3700
11537,Ford,Bronco 4WD,1984,5.8,8.0,Automatic 3-spd,4-Wheel or All-Wheel Drive,Special Purpose Vehicle 4WD,Regular,32.961,10,10,10,888.7,3350


# Filter records
>    - `mask` concept
>    - `.query()` method

This is really important for data wrangling.

## Simple Example: Starting with a numpy array. How can I filter the values of a list?

In [41]:
# Your code here!
array = np.array(range(1, 10, 2))
array

array([1, 3, 5, 7, 9])

The results of `my_array > 5` is what is called **a mask**. A result containing the `True` and `False` results of an operation. 

In [45]:
array > 5

array([False, False, False,  True,  True])

In [46]:
array[array > 5]

array([7, 9])

In [15]:
# Your code here!

Masks can be used as an index to select data!

In [49]:
# Your code here!
df_2017 = df[df.model_year == 2017]
df_2017.model_year

46       2017
140      2017
141      2017
142      2017
143      2017
         ... 
35883    2017
35885    2017
35891    2017
35938    2017
35939    2017
Name: model_year, Length: 857, dtype: int64

In [52]:
(df.fuel_type == 'Premium').value_counts(normalize = True)

False    0.724049
True     0.275951
Name: fuel_type, dtype: float64

In [53]:
is_premium = (df.fuel_type == 'Premium')
df['is_premium'] = is_premium #criação de nova coluna

In [54]:
df.is_premium

0        False
1        False
2        False
3        False
4         True
         ...  
35947     True
35948     True
35949     True
35950     True
35951     True
Name: is_premium, Length: 35952, dtype: bool

After selecting, you can do anything with it, for example assigning it. This operation is called a `vectorial` operation. It is done all at once.

In [17]:
# Your code here!

You can also save the condition

In [18]:
# Your code here!

## Bitwise logical operators - Combining conditions

To make more than one condition together, you can use 
- `&` - analogous to `and`
- `|` - analogous to `or` 

For example, get all numbers from my_array that are greater than 3 and smaller than 8

Let's do it in steps:
- get values greater than 3

In [19]:
# Your code here!

- get values smaller than 8

In [20]:
# Your code here!

- get values greater than 3 and smaller than 8

In [21]:
# Your code here!

## Now in a DataFrame

Let's find the rows in which the Cylinders values are exactly 6.

In [22]:
# Your code here!

### Using conditions to create new columns

In [23]:
# create a column 'is_premium' and set values equal to False

## You can combine conditions

Cars from `Ford` and 6 `Cylinders`

In [25]:
# Your code here!

## Another way to do the same thing

* using the method `query`

The method `query` receives a string in which you can say your condition. Important things:
- `.query()` is a method of your dataframe
- `.query()` method receives a string 
- Every word inside the string that is not `quoted` is considered a variable of your dataframe (so, for example `.query('Year == 1999')` will look for the variable `Year`. Another example: if you try to run `.query('Make == Ford')` will look both for the column name `Make` and the column named `Ford`. If you want the results of the column `Make` to match the **string** Ford, you have to run `.query('Make == "Ford"')`
- If your column has spaces, you have to call it using backticks like in **.query('\`Engine Displacement\` < 4')**:

In [27]:
# Your code here!