# <font color='blue'> Exploratory Data Analysis with Python: Part 1 of 2</font>

### Lise Doucette, Data and Statistics Librarian
### Nich Worby, Government Information and Statistics Librarian
### mdl@library.utoronto.ca

# <font color='blue'> Outline </font>



## <font color='blue'> 1 Overview</font>
## <font color='blue'> 2 Importing libraries and reading  data </font>
## <font color='blue'> 3 Getting help </font>
## <font color='blue'> 4 Viewing your Data </font>
## <font color='blue'> 5 Plotting/Graphing your Data</font>
## <font color='blue'> 6 Selecting and filtering your data </font>


---
In Part II, we'll do a quick review of Part I, and cover crosstabs, grouping data, editing variables, and creating new variables 

---

## <font color='blue'> 1 Overview</blue>

- What is Python and why use it?
- Versions of Python
  - many older projects are written in 2.7, but most people new to Python learn version 3
  - can use the following code to determine what version your system is using:

~~~
    from platform import python_version
    print(python_version())
~~~

   
- How Python works - programming language, objects and methods, libraries, indentation/white space
- Indexing (starts at 0), rows (records) and columns (variables/attributes)
- Some options for using Python - through [Anaconda Navigator](https://www.anaconda.com/distribution/)* you can install Jupyter Lab, Jupyter Notebooks, spyder, a console; [Google Colab](https://colab.research.google.com) is a cloud-based Jupyter notebooks environment.
- Jupyter notebooks - cells of code and markdown; last line determines output of cell; running cells (changes from * to a number)


\* "Anaconda Navigator is a desktop graphical user interface included in Anaconda that allows you to launch applications and easily manage conda packages, environments and channels without the need to use command line commands."  "Anaconda® is a package manager, an environment manager, a Python/R data science distribution, and a collection of over 1,500+ open source packages."


## <font color='blue'> 2 Importing libraries and reading data</font>


### a) Importing packages/libraries

Things to consider:
- functionality that you need 
- you may need to install the libraries first using [Anaconda Navigator](https://docs.anaconda.com/anaconda/navigator/tutorials/manage-packages/), [conda](https://docs.anaconda.com/anaconda/user-guide/tasks/install-packages/), or the [command line](https://packaging.python.org/tutorials/installing-packages/)
- use a nickname/short name for libraries that you will be referring to later (there are some common/standard ones)
- syntax for importing packages/libraries:
~~~
import packagename as nickname
~~~
- for plotting in Jupyter notebooks, need to add one more line to tell it to display the plots directly in the notebook
~~~
%matplotlib inline
~~~

### b) Reading data

Things to consider:
- where the data is stored
    - same folder as your Jupyter notebook or Python file?  don't need to specify the path
    - different folder?  need to specify path
- file type of data (csv, excel, text, other) and whether you might need a package to help you read the data
- how the data is separated (comma, space, semicolon, other)
- is there a header row with variable names?
- pandas makes some guesses about your data format and type
    - int64, float64, object, bool
- in pandas, your data is stored in a data frame

[Importing Data cheatsheat](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Cheat+Sheets/Importing_Data_Python_Cheat_Sheet.pdf)

## <font color='blue'>3 Getting help</font>

- inline/in-program documentation
    
        help()
    
- official documentation - e.g., [Pandas](https://pandas.pydata.org/)
- 'unofficial' documentation aka Googling and finding examples: python sort data   
- cheat-sheets, e.g., [Wrangling Data with Pandas](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)
- online guides/tutorials, e.g., [](http://introtopython.org/var_string_num.html)
- online courses (no fee), e.g, Python courses through [Linked In Learning](https://lnkd.in/gf85Mmv)
- online courses (fee), e.g., [Python for Data Science and AI](https://www.coursera.org/learn/python-for-applied-data-science-ai)

## <font color='blue'>4 Viewing your Data</font>


### a)  View the first few rows of data

Things to note:
- first row is Row 0
- you can indicate how many rows you want to see by including a number in parentheses (default is 5)

### b) Variable/column names and types

Things to consider:
- did pandas guess correctly about the type of data? What can you do if it didn't?  Indicate the type using __.astype__

~~~
titanic['ColumnName'] = titanic['ColumnName'].astype('NewDataType')
~~~
- common data types
  - int64 - integers (whole numbers)
  - float64 - decimal point numbers
  - object - text/string
  - bool - True/False value only

### c) Missing/null data

Things to consider:
- why is the data "missing"? > e.g., not available, not known, participant refused to provide it
- how will missing data affect your analyses? What can you do to address this?
    - need to know when pandas includes/excludes null values

### d) View a summary of your variables

Things to consider:
- what kinds of summary measures are meaningful for different variable types?
- how is the mean value of age calculated?
- documentation for formatting of a command: [describe](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html)

### e) View more meaningful summary data for categorical data

## <font color='blue'>5 Plotting/Graphing your Data</font>

Different types of plots are appropriate for different types of data.  We'll just explore a couple of types here.  For more information, check out the [Data Visualization Guide](https://mdl.library.utoronto.ca/dataviz/getting-started).

### a) Create a bar plot for categorical data

### b) Create a histogram for continuous, numerical data

## Exercise

1. What was the most common age of passengers on the Titanic? Hint: Use a function that gives you meaningful summary data
2. Create a bar plot of passenger class
3. Add a title to the plot created in exercise 2.  For help on how to do so, try Googling or using python help functions.

## <font color='blue'>6. Selecting and filtering your data</font>

Things to note:
- syntax differences when selecting [one] vs [[multiple]] columns

We're going to access columns by calling them by their name. It's helpful to get a breakdown of all column names by using the .columns function.

### a) Select/view one column

In order to select a single column, use the following syntax:

    dataframe['column_name']

### b) Select/view multiple columns

The syntax for selecting multiple columns requires the use of two sets of brackets. Use the following syntax:
        
        dataframe[['col1','col2']]

### c) Select/view specific rows and columns by location


Note: this will return values at the __specific__ row and column labels that you specify (so rows 0:10 will show 11 rows, from Row 0 to Row 10)

We are using the default row labels, which are the row numbers (0,1,2, etc.). If we were using names of individuals as row labels the same code might look like:
        
        titanic.loc['Allen':'Astor',['name','fare']]

### d) Select/view rows and columns by range of indexes/indices

The iloc method will return location based on index positions rather than labels. This will return values at the ranges of rows and columns that you specify.  In Python, this means from the __lower index to one less than the higher index__ (so rows 0:10 will show 10 rows, from Row 0 to Row 9, and columns 0:3 will show 3 columns, from Column 0 to Column 2)

With __iloc__, you can also choose specific columns using double brackets.

## Exercise

1. Show the final 10 rows of the data set, and the name and age columns.

### e) Select/view data that meets certain conditions (filters)

You can create filters based on numeric conditions:


Filters can also be created for strings which sometime requires the use of [regular expressions](https://regex101.com/):

### f) Number of rows that meet your conditions


### g) Combine multiple filters

Note: Combine with __&__ (this means AND) or __|__ (this means OR)

### h) Sort Data

Numeric values can be sorted to be displayed either ascending (lowest to highest) or descending values (highest to lowest). Sorting data frames by the value of cells in a particular column uses the following syntax:

        dataframename.sort_values(by=['column'],)
        
Note: The default setting is to sort from lowest to highest. To switch to ordering highest to lowest, add the ascending=False argument.

## Exercise

1. Create a filter that lists passengers who did not survive
2. Combine the filters we created earlier in Section 6e) to create a list of passengers with the name Robert who survived
3. Create a filter that lists passengers in class 1 who were more than 30 years old
4. How many passengers fit the criteria from question 3?
5. Create a filter to search for passengers with the following honorific titles in their names: Sir, Lady, Jonkheer.