# Data Science for Absolute Beginners

## Basic Transformations on DataFrames


Our approach in this section will be to use, ironically enough, R as a starting point. If you are familiar with R, dplyr is a ``grammar`` of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges. The dplyr transformations on a dataframe are:

- **mutate()** adds new variables that are functions of existing variables
- **select()** picks variables based on their names.
- **filter()** picks cases based on their values.
- **summarise()** reduces multiple values down to a single summary.
- **arrange()** changes the ordering of the rows.

Our starting point is a dataframe and operations on the dataframe. Dplyr can be thought of as five basic data operations on the dataframe. 



![title](images/dplyr.png)

In [1]:
import pandas as pd
import seaborn as sns

sns.set_style('darkgrid')

## 0. Load DataFrame

### Source:

This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University. The dataset was used in the 1983 American Statistical Association Exposition.


### Data Set Information:

This dataset is a slightly modified version of the dataset provided in the StatLib library. In line with the use by Ross Quinlan (1993) in predicting the attribute "mpg", 8 of the original instances were removed because they had unknown values for the "mpg" attribute. The original dataset is available in the file "auto-mpg.data-original".

"The data concerns city-cycle fuel consumption in miles per gallon, to be predicted in terms of 3 multivalued discrete and 5 continuous attributes." (Quinlan, 1993)

### Attribute Information:

0. mpg: continuous
1. cylinders: multi-valued discrete
2. displacement: continuous
3. horsepower: continuous
4. weight: continuous
5. acceleration: continuous
6. model year: multi-valued discrete
7. origin: multi-valued discrete
8. car name: string (unique for each instance)




In [10]:
# load dataset

In [11]:
# check first few records

In [12]:
# check dataframe info

In [13]:
# dataframe describe statistics

## 1. Select

select means we select a subset of the columns

In [14]:
# select individual column


In [15]:
# select multiple columns


In [16]:
# create a new dataframe mpg based on the selection


In [17]:
# alternative using loc


## 2. Filter

filter means we filter the dataframe based on some criterion

- Step 1. Build the query e.g. ``df['origin']=="usa``
- Step 2. Wrap the query inside the name of the dataframe ``df [query]``

In [30]:
# filter for usa


In [18]:
# create a new dataframe usa based on the filter

In [19]:
# check unique values in origin column

In [20]:
# compound queries

## 3. Arrange

arrange we means we sort the values of a column(s)

In [21]:
# 5 lightest cars; all japanese cars?


In [22]:
# 5 heaviest cars; all usa ?


## 4. Mutate

In [11]:
# create a new column weight/mpg


In [23]:
# check dataframe

## 5. Summarise

In [24]:
# groupby based on origin


In [25]:
# check mean of mpg by origin

In [26]:
# check mean of mpg, horsepower, and acceleration by origin