# <font color='blue'> Exploratory Data Analysis with Python</font>

### Lise Doucette, Data and Statistics Librarian
### Nich Worby, Government Information and Statistics Librarian
### mdl@library.utoronto.ca

# <font color='blue'> Outline </font>



## <font color='blue'> 1 Overview</blue>

## <font color='blue'> 2 Import libraries and import your data </font>

## <font color='blue'> 3 Viewing your Data </font>

## <font color='blue'> 4 Selecting and filtering your data </font>

## <font color='blue'> 5 Create crosstabs and grouping data </font>

## <font color='blue'> 6 Editing data / creating new fields </font>

## <font color='blue'> 7 Getting help </font>

---

## <font color='blue'> 2. Import packages/libraries and import your data</font>


### a) Import packages/libraries

Things to consider:
- functionality that you need 
- depending on setup of Python on your computer, you may need to install the libraries first using [Anaconda Navigator](https://docs.anaconda.com/anaconda/navigator/tutorials/manage-packages/), [conda](https://docs.anaconda.com/anaconda/user-guide/tasks/install-packages/), or the [command line](https://packaging.python.org/tutorials/installing-packages/)
- using a nickname/short name for libraries that you will be referring to later (there are some common/standard ones)
- syntax for importing packages/libraries: __import packagename as nickname__ 

In [None]:
import 

### b) Import existing data

Things to consider:
- where the data is stored
    - same folder as your Jupyter notebook or Python file?  don't need to specify the path
    - different folder?  need to specify path
- file type of data (csv, excel, text, other) and whether you might need a package to help you read the data
- how the data is separated (comma, space, semicolon, other)
- is there a header row with variable names?
- pandas makes some guesses about your data format and type
    - int64, float64, object, bool
- in pandas, your data is stored in a data frame

[Importing Data cheatsheat](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Cheat+Sheets/Importing_Data_Python_Cheat_Sheet.pdf)

## <font color='blue'>3. Viewing your Data</font>


### a)  View the first few rows of data

Things to note:
- first row is Row 0
- you can indicate how many rows you want to see by including a number in parentheses (default is 5)

### b) Variable/column names and types

Things to consider:
- did pandas guess correctly about the type of data? What can you do if it didn't?
    - use __.astype()__
    - code: __titanic['ColumnName'] = titanic['ColumnName'].astype('NewDataType')__
- you can also see just a list of column names with the code: __titanic.columns__

### c) Missing/null data

Things to consider:
- why is the data "missing"?
- how will missing data affect your analyses? What can you do to address this?
    - need to know when pandas includes/excludes null values

### d) View a summary of your variables

Things to consider:
- what kinds of summary measures are meaningful for different variable types?
- how is the mean value of age calculated?

### e) View more meaningful summary data for categorical data

### f) Create a bar plot for categorical data

### g) Create a histogram for continuous, numerical data

## Exercise

1. What was the most common age of passengers on the Titanic? Hint: use value_counts()
2. Create a bar plot of passenger class

## <font color='blue'>4. Selecting and filtering your data</font>

Things to note:
- syntax differences when selecting [one] vs [[multiple]] columns

### a) Select/view one column

### b) Select/view multiple columns

### c) Select/view specific rows and columns

Note: this will return values at the specific rows and columns that you specify (so rows 0:10 will show 11 rows, from Row 0 to Row 10)

### d) Select/view rows and columns by range of indices

Note: this will return values at the ranges of rows and columns that you specify.  In Python, this means from the lower index to one less than the higher index (so rows 0:10 will show 10 rows, from Row 0 to Row 9, and columns 0:3 will show 3 columns, from Column 0 to Column 2)

## Exercise

1. Show the final 10 rows of the data set, and the  name and home destination columns.

### e) Select/view data that meets certain conditions (filters)

### f) Number of rows that meet your conditions


### g) Combine multiple filters

Note: Combine with & (this means AND) or | (this means OR)

## Exercise

1. Create a filter that lists passengers who did not survive
2. Combine the filters we created earlier to create a list of passengers with the name Robert who survived
3. Create a filter that lists passengers in class 1 who were more than 30 years old
4. How many passengers fit the criteria from question 3?


## <font color='blue'>5 Creating crosstabs and grouping data</font>

### a) Create crosstabs

Things to think about:
- data types of variables you're interested in

Use the normalize argument to display crosstab values as percentages

Cross tabs aren't just limited to comparing two variables at a time. Let's say we want to compare passenger class, sex and survival rates. We can use square brackets [ ] to incorporate more variables into the crosstab, similar to earlier exammples.

### b) Grouping Data 

- when does it make sense to use sum, mean, value_counts?

### Exercise

1. Create a crosstab to show the numbers of men and women who survived.
2. Create a table to show the same data using groupby.

Which output is easier to read?

## <font color='blue'>6 Editing data and creating new fields</font>

Often variables in datasets use codes that aren't very descriptive. It's helpful to first view all codes in a variable before editing.

Next, read the codebook to understand what the codes mean. There are 3 codes for embarkation points: S = Southampton, C = Cherbourg and Q = Queenstown. Start the next line with the name of the variable you would like to edit, e.g. titanic['embarked']. 

Use the = sign next to make sure you write the change to the entire variable and save it. This is similar to value assignment in algebra, e.g. x = y + z. 

We can use the .replace( ) method to change our codes to names. We can use .value_counts( ) to check our work.

### Creating new variables

The syntax for creating new variables in a dataframe starts by calling the dataframe by name and placing the variable name is square brackets in quotes and assigning value with an equal sign. e.g. dataframe['new variable'] = value.

Let's say we want to calculate the fare variable in Canadian dollars. In 1912, the value of the Canadian dollar was pegged at 4.8666CAD to one British Pound Sterling.

Check if the new variable has been added by using the .head( ) method.

## Exercise

Create a new variable called 'is_child'. Filter the data for all passengers under the age of 18 and assign the results to the new variable. Check your new variable using .value_counts(). Next, do a crosstab to check survival rates for children vs. adults. **Bonus:** Add pclass to the crosstab to see how many children and adults in first, second, and third class survived or perished.