# FUNDAMENTALS OF DATA ANALYSIS WITH PYTHON <br><font color="crimson">DAY 5: DATA CHALLENGES AND 1:1 RESEARCH CONSULTATIONS</font>

49th [GESIS Spring Seminar: Digital Behavioral Data](https://training.gesis.org/?site=pDetails&pID=0xA33E4024A2554302B3EF4AECFC3484FD)   
Cologne, Germany, March 2-6 2010

## Course Developers and Instructors 

* Dr. [John McLevey](www.johnmclevey.com), University of Waterloo (john.mclevey@uwaterloo.ca)     
* [Jillian Anderson](https://ca.linkedin.com/in/jillian-anderson-34435714a?challengeId=AQGaFXECVnyVqAAAAW_TLnwJ9VHAlBfinArnfKV6DqlEBpTIolp6O2Bau4MmjzZNgXlHqEIpS5piD4nNjEy0wsqNo-aZGkj57A&submissionId=16582ced-1f90-ec15-cddf-eb876f4fe004), Simon Fraser University (jillianderson8@gmail.com) 

<hr> 

## <i class="fa fa-map-o"></i> PLAN FOR THE DAY

We will run our final day together (😲) a bit differently than we have done things to date. First of all, we have prepared a series of challenges that will put your new knowledge and skills to work. From here on out, it's all practice, practice, practice! And, as we have emphasized throughout the week, we strongly encourage you to adopt a **problem-based learning** mindset. To that end, we encourage you to use real data from your own research projects all day today. 

In addition to these challenges, Jillian and I will make ourselves available for 1:1 "consultations," for lack of a better word. We are happy to review material, answer questions, to walk through some analyses of your data, to help you problem-solve and navigate the documentation for packages that are relevant to your work, and so on. We are happy to provide whatever help is most beneficial to you, but most of all we hope that you will walk away from the day (actually the whole week!) having used these new tools to make some progress on problems related to your own research, so if you have access to your data (in any file formats), bring it! 

### Course Evaluations 

Angelika and / or Kathrin from GESIS will be here in the room at 11:00 am to administer course evaluations over the Ilias system. 

### Fixed Time for Lunch Today 

We will be breaking for **lunch at 1:00 pm**, returning at 2:30 pm. 

### <i class="fa fa-camera-retro"></i> Optional Group Picture 

We would like to take a quick group picture at 12:30 pm. We hope you will join, but if you would rather not that's fine too. 

# <font color="crimson">DATA CHALLENGES</font>

The difficulty of the data challenges below vary, but we have not indicated *how* difficult we expect them to be, because that will vary person to person. We encourage you to complete as many as you can, but you should feel free to focus primarily on the ones that are the most relevant to your research interests. For example, if you don't expect to work with web scrapers, but you do expect to work with APIs, focus on the API challenges. If you don't expect to work with either, but you do work extensively with survey data, focus on importing your data, performing common data operations with `pandas`, `matplotlib`, etc. Finally, we encourage you to work as much as you can at the edge of your comfort zone, but don't spend all day doing that. Mix challenges that push you hard with challenges that reward you with some easy wins. 👏👏👏 

Most of the challenges below can be completed using only the knowledge and skills you acquired this week. Others require that you use the meta-skill of problem-solving by searching and reading documentation. We have indicated the latter type of challenge with the <font color="crimson"><i class="fa fa-search"></i></font> icon. 

Consult the course materials, documentation, Google, and StackOverflow as much as you like / need! But try not to rely too much on copy and paste. Even if you are going to use the *exact* same code, it is better to type it out yourself. You will learn more this way. 

## This is a buffet, not a hot dog eating contest 🤢

We do **not** expect anyone to do all of these challenges, and we do not really want you to try to do them all. We would prefer that you select the ones that are the most relevant for you. We recommend that you do not skip challenges that are relevant but at the edge of your comfort zone. Feel free to skip challenges that are less relevant to your research regardless of how challenging you might find them, but consider doing *some* less relevant challenges because they will develop your general competencies doing data analytic programming in Python. 

## Your Own Data, Your Own Questions 

As previously mentioned, we **strongly** encourage you to work with your own data here, or if not that you use a new dataset that we have not used very much or at all this week. 

Similarly, some of the challenges below are open-ended and occasionally a bit vague. That's because we want you to tailor them to *your own interests*. If we say something like "perform some operation on a `series` and assign the results to a new `series`", you should interpret this as open-ended challenge as an invitation to work on your own data. We are not trying to be cryptic. 

Don't waste time speculating about the finer details of what we are looking for. What we want is to challenge you, help you practice, help you consolidate your new knowledge and skills, and for you to have something tangible that **you did** this week. 

## Data We Can Provide: Waves of the European Values Survey (EVS)

In addition to the Twitter data we have been working with all week, we have the **European Values Survey** available for you to use if you like. We are not putting it in the GitHub repo, so you will need to get it from one of us via a USB key. We have data from **1981**, **1990**, **1999**, **2008**, and **2017**. You are welcome to use any of this data if you like. 

The data, which was provided by GESIS, is available in proprietary formats from Stata or SPSS. Python can read these files (in fact, that is one of the challenges). 

## Outline & Quick Access Links

* [Python 101](#101)
    * Data structures
    * Conditional execution 
    * Iteration 
    * <font color="crimson"><i class="fa fa-search"></i></font>  Error handling
    * Designing and developing functions 
* [Collecting Data from the Web](#collection)
    * Web Scraping 
    * APIs 
    * <font color="crimson"><i class="fa fa-search"></i></font>  Saving collected data
* [Importing and Inspecting Datasets](#import)
    * CSV files 
    * Importing data from proprietary file types 
        * <font color="crimson"><i class="fa fa-search"></i></font> Stata, Excel, SPSS, etc.
* [Data Management and Manipulation with Pandas](#MANAGEMENT)
    * Selecting `Series` (variables / columns) 
    * Filtering observations 
    * Aggregation and Grouped Operations 
    * Working with `Datetime` objects 
    * Creating new `series`
    * <font color="crimson"><i class="fa fa-search"></i></font> Processing survey data collected in waves 
    * <font color="crimson"><i class="fa fa-search"></i></font> Linking survey data with digital behavioural data
* [Statistics](#STATISTICS)
    * Measures of central tendency and dispersion 
    * Correlation and covariance 
    * <font color="crimson"><i class="fa fa-search"></i></font> Contingency tables and Chi-square tests
    * <font color="crimson"><i class="fa fa-search"></i></font> Write data analysis results to a file (e.g. LaTeX, markdown, or HTML tables)
    * <font color="crimson"><i class="fa fa-search"></i></font> ANOVA, linear regression, multiple regression, logistic regression, etc.   
* [Writing CSV files](#tocsv)
    * Write to a subdirectory 
    * <font color="crimson"><i class="fa fa-search"></i></font> Optional parameters
* [Data Visualization](#VIZ)
    * Counts, Distributions 
    * Relationships
    * Small multiples and dashboard-style visualizations 
    * Time series plots 
    * <font color="crimson"><i class="fa fa-search"></i></font> Pair plots 
    * <font color="crimson"><i class="fa fa-search"></i></font> Heatmaps 
    * <font color="crimson"><i class="fa fa-search"></i></font> Dendrograms 
* [Text Processing 101](#TEXT)
    * Getting text into `spaCy` 
    * Counting word frequencies 
    * Removing stop words and punctuation 
    * Selecting tokens by their part-of-speech 
    * Finding named entities 
    * Entity co-occurrence networks 
    * Difference of proportions analysis 
* [Exporting Jupyter Notebooks to HTML, PDF, etc.](#export)

# <font color="crimson">PYTHON 101 <a id='101'></a></font>


## Data Structures

* Define an `x` and `y` variable, assigning an integer value to both. Create a third variable `z` which is the sum of `x` and `y`. 

* Create 2 lists, each of length 5. Merge the two lists to get a single list of length 10. Use slicing to access the 3rd through to (and including) the 7th element in that list. 

* Create a dictionary to hold information about a movie (director, release year, etc). Without creating a new dictionary, add 1 more key-value pair to the dictionary. 

## Conditionals

* Write a conditional expression that has at least one `if`, `elif`, and `else` condition. 

* Write a conditional expression (if statement) using `!=`.

* Combine multiple conditions using `and` or `or` operations in an `if` statement. 

## Iteration

* In the cell below write a `for` loop that iterates over a list, performs some transformation to each value, and stores the resulting values in a new list. 

* Write a `while` loop to iterate over a list or string until some condition is met. 

* Use list comprehension to filter a list according to some condition

* Create two iterables (lists, Series, etc) of the same length. Use the `zip()` function to iterate over the lists in a pair-wise fashion. Transform or `print` the paired values at each step. 

## Errors
* Examine the error produced in the 3 cells below. Explain what the errors mean. If needed, refer to python documentation or Stack Overflow for help.

In [None]:
l = [1, 2, 3]
l[3]

In [None]:
d = {'name': 'Karl', 
     'birth_year': 1975, 
     'birth_country': 'United Kingdom'}

d['birth-year']

In [None]:
x = '5'
y = 10

x + y

* Write a loop which you know will produce an error (like the ones above). Implement a `try` `except` block to gracefully deal with these errors. 

# Designing & Developing Functions
* Read the code in the cell below. Take time to understand the purpose of the function. Fill in the doc string with your explanation of the function. Try to remember the 3 components all doc strings should contain. 

In [None]:
def simple_text_prep(strings):
    """
    # YOUR ANSWER HERE
    """
    prepped = []
    for s in strings:
        low = s.lower()
        no_ws = low.strip()
        words = no_ws.split(' ')
        prepped.append(words)
    
    return prepped


titles = ['\nStrategies for Reflexive Ethnography in the Smart Home: Autoethnography of Silence and Emotion  ', 
          '\t\tThe Methodological Divide of Sociology: Evidence from Two Decades of Journal Publications', 
          '  Popular but Peripheral: The Ambivalent Status of Sociology Education in Schools in England \n', 
          'Exploring Women’s Mutuality in Confronting Care-Precarity: ‘Care Accounts’ – a Conceptual Tool']

titles_prepped = simple_text_prep(titles)

* Write a function that takes in some value, applies a transformation, and returns the transformed value. 

* Write a function that makes use of an optional (aka default) paramter. 

# <font color="crimson">COLLECTING DATA FROM THE WEB<a id='collection'></a></font>

## Web Scraping

* Identify a website you want to scrape. Explore the source code on the website to discover patterns and determine where the data of interest to you is being stored. Implement a scraper to access and retrieve the data of interest to you.


* Implement a programmatic way to scrape data from multiple URLs, keeping the idea of polite web scraping in mind. 

## APIs

* Find an API that is of interest to you. Explore the documentation and use the `requests` library to successfully access the API (200 Response). Determine what format is used to structure the `content` attribute of the response.

* Implement a programmatic way to make multiple API calls, keeping rate limits and the idea of polite web-scraping in mind. 

## <font color="crimson"><i class="fa fa-search"></i></font> Save Collected Data
Using previously written code, collect data from the web. Once you've collected this data, write the results to a file that can be access later.

# <font color="crimson">IMPORTING AND INSPECTING DATASETS<a id='import'></a></font>

## CSV Files

* Import a CSV file using `pandas`. 

* Import a CSV file using `pandas` with an optional parameter that specifies the delimeter (e.g. comma, tab) that marks the boundaries between columns in the dataset. 

## Reading Data from Proprietary File Types

* Import an SPSS data file using `pandas`. If you do not have your own, we can provide files from the EVS.

* Import an SPSS data file using `pandas`. If you do not have your own, we can provide files from the EVS.

# <font color="crimson">DATA MANAGEMENT AND MANIPULATION<a id='MANAGEMENT'></a></font>

## Selecting `Series` (variables / columns) 

* Select a single `series` from your `dataframe` and convert it to a list. Give it a new name. 

* Create a subset of your `dataframe` that consists of a subset of columns (>= 2) that you will (or might) use in an analysis. Give it a new name. 

## Filtering observations 



* Create a subset of your `dataframe` that filters observations (rows) based on values for a quantitative variable. 

* Create a subset of your `dataframe` that filters observations based on values for a categorical values.

## Aggregation and Grouped Operations 

* Create a `grouped` object from your dataframe based on one categorical variables. Compute some descriptive statistics for each group.

* Create a `grouped` object from your dataframe based on **two or more** categorical variables. Compute some descriptive statistics for each group.

## Working with `Datetime` objects 



* If your data has a variable with date or time information, convert it to a `datetime` object. If possible (it depends on the specifics of your data), use the `resampling` method to upsample or downsample your data. Compute the number of observations in each date / time group. 

* Compute some descriptive statistics for date / time groups in your data. Compute changes in those statistics from one date / time group and the one that follows. Feel free to resample up or down to a level that makes sense given your research interests. 

## Creating new `series`

* Perform some sort of operation on a `series` in your `dataframe`. Assign the resulting values to a new `series`. 

* Perform some sort of operation on **multiple** `series` in your `dataframe`. Assign the resulting values to a new `series`.

## <font color="crimson"><i class="fa fa-search"></i></font> Processing survey data collected in waves 



* Import multiple waves of your survey data (or the EVS) and assign them informative descriptive names (e.g. `South_Korea_Time_Use_2015`). 

* If necessary, create a new variable for each wave that stores data, time, and other important metadata. Then select the subset of columns you intend to analyze and combine the `dataframes` for each waves into a single `dataframe`. 

* Compute some statistics for each wave and examine how they change over time. 

## <font color="crimson"><i class="fa fa-search"></i></font> Linking survey data with other datasets

* Link your survey data with another dataset. Some potential options include: social media data if handles provided by participants, data on values of a relevant categorical variable (e.g. population of a city), and so on. 

# <font color="crimson">STATISTICS<a id='STATISTICS'></a></font>

## Measures of central tendency and dispersion 

* Compute measures of central tendency and dispersion for quantitative `series` / variables in your `dataframe`

## Correlation and covariance 

* Compute the correlation between two `series` / variables in your `dataframe`.

* Compute a correlation matrix for the quantitative variables in your `dataframe`.

* Compute the covariance for two quantitative `series` / variables in your `dataframe`. 

## <font color="crimson"><i class="fa fa-search"></i></font> Contingency tables and Chi-square tests

* Compute a contingency table for categorical variables in your `dataframe`. Perform a Chi-square test of independence. 

## <font color="crimson"><i class="fa fa-search"></i></font> Write data analysis results to a file (e.g. LaTeX, markdown, or HTML tables)

* Use the [`tabulate`](https://pypi.org/project/tabulate/) package to write a table to an external file. Use a format that makes sense for how you write (e.g. a LaTeX file, a markdown file, an HTML file). Use optional parameters to modify how the content is written. 

## <font color="crimson"><i class="fa fa-search"></i></font> ANOVA, linear regression, multiple regression, logistic regression 

* Use Python to estimate a statistical model (however simple or complex) that you might actually use in your research. It doesn't have to be the best model you are able to develop, as the point is to learn how to develop and estimate a model using Python. A good place to start for help is the [`statsmodels`](https://www.statsmodels.org/stable/index.html) package. You can consult their documentation, including a set of [worked examples](https://www.statsmodels.org/stable/index.html). Of course, you can also ask one of us for help. 

* Use Python to create plots for regression diagnostics. 

* Use Python to create a coefficient plot, if that's your thing. 😀

# <font color="crimson">WRITING CSV FILES<a id='tocsv'></a></font>

## Write to a subdirectory 

* Use `pandas` to write your `dataframe` to disk. At minimum, put this file in a subdirectory called `output` or `data`. **Do not overwrite the original dataset you imported!**

# Use optional parameters 

* Use optional parameters in the function you used above to (1) specify the delimeter separating columns and (2) to prevent `pandas` from writing the `index` to the CSV file. 

# <font color="crimson">DATA VISUALIZATION<a id='VIZ'></a></font>

For all visualization challenges below, focus on developing graphs that are as clear and informative as possible. Try to follow practices based on the realities of human perception and the vision system as much as possible (see the preamble to the viz notebook). Aesthetic preferences are also perfectly valid considerations, but they come second. Customize the look of your graphs as much as you like, but do not compromise on clarity. Remember, in general, less is more. Simplify things, but not at the expense of showing what you really need to show. 

For all graphs below, call `plt.show()` to see the graphs in this notebook, but *also* save high resolution images in an `img` subdirectory as vector graphics (PDF for print, SVG for web) or high-resolution raster graphics (PNG, JPEG).

## Counts, Ranks, Distributions 

* Produce a horizontal bar graph or a Cleveland dot plot to show count data or ranks for some `series` or variable in your `dataframe`.

* Produce a histogram to visualize the distribution of some quantitative `series` / variable. Modify the number of bins. 

* Produce a kernel density estimate to visualize the distribution of some quantitative `series` / variable. 

## Relationships

* Produce a `scatterplot` of the relationship between two variables. 

* <font color="crimson"><i class="fa fa-search"></i></font> Use [`seaborn`](https://seaborn.pydata.org/examples/index.html) to create a `scatterplot` with marginal histograms. 

* <font color="crimson"><i class="fa fa-search"></i></font> Use [`seaborn`](https://seaborn.pydata.org/examples/index.html) to create a `scatterplot` with a linear regression line. 

* <font color="crimson"><i class="fa fa-search"></i></font> Use [`seaborn`](https://seaborn.pydata.org/examples/index.html) to create a `scatterplot` with a nonparametric lowess model. Note: `seaborn` will take care of the model estimation itself using the `statsmodels` package.  

## Small multiples and dashboard-style visualizations 

* Create a "small multiples" graph with 4 subplots. The graphs can be of the same type, or they can differ (i.e. like a "dashboard"). 

* Create a "small multiples" graph with 3 subplots. The graphs can be of the same type, or they can differ (i.e. like a "dashboard"). 

## Time series plots 

* Produce a time series plot. If you do this challenge, you should also do the `Datetime` object challenge above. Use the same data to produce this graph. 

## <font color="crimson"><i class="fa fa-search"></i></font> Pair plots 

* Use `seaborn` to produce a paired density and scatterplot matrix. You can follow the example from the [documentation](https://seaborn.pydata.org/examples/pair_grid_with_kde.html). 

## <font color="crimson"><i class="fa fa-search"></i></font> Heatmaps 

* Produce a correlation matrix of a subset of your `dataframe`. Visualize that correlation matrix as a heatmap using  `matplotlib`, `seaborn`, or both. 

## <font color="crimson"><i class="fa fa-search"></i></font> Dendrograms 

* Perform a hierarchical clustering analysis and visualize the results as a dendrogram. Cut the denrogram at a defensible place and store the cluster labels as a new `series`. 

# <font color="crimson">TEXT PROCESSING, 101<a id='TEXT'></a></font>

## Getting text into `spaCy` 

* Load a language model into `spaCy`. (We downloaded the small English core model trained on web data. You can download other models, including for other languages. There is a German model.) Then pass your data through the `nlp` pipeline as strings (you will need to iterate) or as a list using `nlp.pipe()`. Print the `doc` object. 

## Counting word frequencies 

* Access the `tokens` in your `spaCy doc` or `docs` to count how often each token appears in a `doc`, or across `docs`. Go a bit further by creating a dot plot of the 25 most common *words*. Consider restricting them to a specific part-of-speech (e.g nouns).

## Removing stop words and punctuation 

* Create a `list` of `tokens` that does *not* include stop words or punctuation. Do this for each `doc` object you create. 

## Noun chunks

* Create a list of noun chunks for each `doc` object you create. 

## Selecting tokens by their part-of-speech 

* Create a list of the nouns (or some other part-of-speech) in each of the `doc` objects you created. 

## Finding named entities 

* Count the number of named entities `spaCy` identifies in each general category (e.g. GPE, NOPR, PERSON). 

## Entity co-occurrence networks 

* Create a network where entities are linked when they co-occur in a sentence, paragraph, or some other meaningful entity. 

## Difference of proportions analysis

* Use the `" ".join()` method to create two strings (representing a category of text of some kind) that can be compared using a simple difference of proportions analysis. Compare how often certain words are associated with each category. Go further by visualizing the top words from each category with a dot plot or horizontal bar graph. 

# <font color="crimson"> Exporting Jupyter Notebooks to HTML, PDF, etc. </font><a id='export'></a></font>

* From the file menu above or the command line / terminal, export this notebook as (1) an HTML file and (2) as a PDF file. Open them to verify the export worked. 

# `</course>`

Congratulations! 👏

Thanks for a great week! It's been a pleasure getting to know you and to learn about your research projects! Shoot us a message if you ever find yourself in Toronto, Waterloo, St. John's, or Vancouver, or if you are going to a conference you think one of us might also be attending. 😀

John (<john.mclevey@uwaterloo.ca>) and Jillian (<jillianderson8@gmail.com>)