# Pandas Guided Practice

***

### GP Goals

For this Guided Practice we will work with the Titanic dataset, in this instance provided by uesr NotAyushXD on github.

Their profile can be found here: https://github.com/NotAyushXD

Our goals will be to explore basic to intermediate Pandas including:

- Variable Types
- Data Frames
- Data Inspection
- Data Visualization

## Import pandas

In [1]:
# First import pandas with alias


# Import dataset from url
url = "https://raw.githubusercontent.com/NotAyushXD/Titanic-dataset/master/train.csv"


One of the most important first steps when conducting data cleaning and exploration is to inspect the variable types, and identify relevant variables. When working with pandas dataframes, efficient method to use when inspecting variables is the `.head()` method which will return the first rows of a dataset.

In [2]:
# View first five rows of titanic dataframe


When working with pandas dataframes and datasets in general it is important to make sure that variables within the dataset are associated with the appropriate data type.

This will ensure that we are able to conduct analysis on our data.
We can use the `.dtypes` accessor to view the data types associated with the variables in our dataframe. 

Generally speaking:

| Variable Type | Typical Data Types |
|---------------|--------------------|
| Continuous    | `float`            |
| Discrete      | `int`              |
| Binary        | `bool`,`str`,`int` |
| Nominal       | `str`,`int`        |
| Ordinal       | `str`,`int`        |

In [3]:
# View data types of each variable in titanic dataframe


 In Python, there is a Pandas-specific data type for categorical variables called 'category' which makes it possible to store category names with associated values as an attribute (other languages have comparable data types, like the `factor` variable in R). The `pandas.Categorical` function will allow us to convert categorical data to this data type. 
 
 Let's explore creating a ranking to the `Pclass` variable. First we can use the `.unique()` method to view the unique values within `Pclass`.

## `.unique()` / `.nunique()`

In [4]:
# View all unique values used in Pclass


In [5]:
# View count of unique values used in Pclass


## `pd.Categorical()`

Let's assume the order of ranking among the values in Pclass is 3 < 2 < 1, so that 1 would represent 1st class, 2 represents 2nd class, and 3 represents 3rd class.

To allow Python to understand this ranking, we can utilize the pandas.Categorical() function and sest an order to the values in the variable. Let's apply this function to the Pclass variable:

In [6]:
# Give a ranking to the Pclass variable


Now when we use the `.unique()` method applied to `Pclass` Python will inform us of the ranking within the variable:

In [7]:
# Check the order of the Pclass variable


If we use the `dtypes` accessor again, we can see that the `Pclass` variable has a new data type as well: `categorical`:

In [8]:
# Review the data types of the titanic dataframe


With an order to the `Pclass` variable set, we can also use the `.sort_values()` method to order the rows within our dataframe based on the ranking within `Pclass`:

In [9]:
# Sort values in titanic based on Pclass values with order applied


When importing and exporting datasets, variables are sometimes assigned data types that don't make sense given how they are meant to be interpreted or used. 

For example, if age (a continuous variable) is represented as a character string, it will be impossible to perform numerical operations on it. 

In this case, it is important to alter data types so that models and operations can be applied in a sensical way. 

Here we can use the `.astype()` method to change the data type of the variable. Let's use `.astype()` on the `Age` variable to alter it from a `float` to an `int` datatype:


In [10]:
# Change data type of Age variable to integer
# titanic["Age"].astype(int)

Because we are missing values, pandas is not able to convert the data type of `Age` to integer. To resolve this let's first take a look at the values within the variable:

In [11]:
# View unique values in Age variable


For this lab, we can manage the missing values by using the `.fillna()` method. Here we will utilize the forward fill parameter which will propagate non-null values forward to fill in any missing values:

In [12]:
# Fill in null values in Age variable


We can check that there are no missing values by using the `.isna()` method on the variable chained with the `.any()` method:

In [13]:
# Check for null values in Age variable


Great, now that we know there are no longer any null values, we should be able to change the data type:

In [14]:
# Change data type of Age variable to integer


Now our variable contains only whole numbers to represent the age of each passenger on the titanic!

With this, we could plot the `Age` variable against another variable in our DataFrame.

In [15]:
# create a scatterplot with Age and another variable
# using only pandas


In [16]:
# import seaborn and matplotlib 


In [17]:
# create a pairplot with seaborn


In [18]:
# create a heatmap
# to view correlation between variables


## `.groupby()`

We can group our data by variables using the `.groupby()` method. 

In SQL, if we wanted to understand how many males survived, or how many females we would call something like this: 
    
      SELECT Survived, Sex, count(*)
      FROM titanic
      GROUP BY Survived, Sex
      ORDER BY Survived, Sex;   
      
With pandas, we can envoke this functionality with `.groupby()`.

In [19]:
# group the data by the survived and sex variables
# returning the count 

# check variable
