# SI 618: Data Manipulation and Analysis
## 01 - Introduction
### Dr. Chris Teplovs, School of Information, University of Michigan
<small><a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a>This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.

# Overview of Today
* Teaching team introductions
* Why this course?
* What you’ll learn
* Syllabus walk-through
* Introduction to Data Manipulation
* Introduction to Anaconda and Jupyter Notebooks

# About your instructor: Dr. Chris Teplovs
* Originally from Canada (and currently  living there)
* Ph.D. in Curriculum, Teaching and Learning  from the University of Toronto
* Postdoctoral Fellow at Copenhagen Business School
* Visiting Associate Research Professor, École Normale Supérieure de Cachan, France
* Lead Developer, Office of Academic Innovation
* Lecturer & Research Scientist, School of Information

# About the teaching team

* Julie Burke
* Scott Henry
* Scott Zabrowski
* Shiyan Yan

# Icebreaker: Two truths and a lie  
1. Think of three statements about yourself. Two must be true statements, and one must be false. 
1. For each person at your table, you share the three statements (in any order) to the group. 
1. The goal of the icebreaker game is to determine which statement is false. 


# Why this course?

* About 80% of initial work on a data science project involves **data manipulation**
 * accessing, converting, transforming, cleaning, filtering, aggregating, grouping, summarizing
* Data analysis is tightly coupled with data manipulation, especially when iterating

# Exploratory Data Analysis is Detective Work
* Tools to find clues, put together a story, identify suspects
* Sometimes clues are accidental or misleading…
 * but it’s critical to find them anyway…
   * because overlooking potential clues could be even worse!

# Goals of Exploratory Data Analysis
* Provide simplified descriptions of what’s happening in (possibly) complex data
* Look below the surface to get new insights
* See interesting behavior through visualization

# We want surprises!
* Exploration & visualization closely linked
* We need pictures with impact: the Goldilocks Principle (not too much, not too little, just right)
* Make comparisons easy


![S&P 500 Historical Index](assets/SP500.png)

# General approach to exploratory data analysis

* First, find or define "normal"
* Then, compute and visualize differences from the normal
* If studying an event or property in a dataset
* What frequency would you "normally" expect?
* What was the actual frequency observed, compared to the "normal"
* Pick visualizations that emphasize those differences

# Put another way:

* You have data -- possibly a lot of it -- in an input format
* You need to manipulate (transform, filter, sort, etc.) it
* You need to visualize your data
* You might need to perform some numerical or other analysis on the data
* You might need to store the results

# Our basic abstraction: the table

![table description](assets/table.png)

## Table operations
* Inserting, updating, and deleting, records (a.k.a. CRUD: Create, Retrieve, Update, Delete)
* Selecting records (row subsets)
* Filtering records (row subsets)
* Sorting records
* Slicing records (column subsets)
* Chaining these together

# Skills you'll learn in this course

* How to get / read / gather / fetch / crawl data
* How to convert data to and from important formats
* Basic computation and manipulation of the data, including filtering and sorting
* Methods to explore and visualize to gain insights
* Applied statistical methods
* Machine learning methods
* Approaches to handling Big Data

# Tools you'll learn in this course

* Python core functionality
* Jupyter notebooks
* Python packages
 * pandas, 
 * matplotlib, seaborn, 
 * scikit-learn, 
 * re, NLTK, gensim, 
 * pyspark

# Syllabus walk-through

[Canvas](https://umich.instructure.com/courses/264741)

[Syllabus](https://docs.google.com/document/d/15h38xlL3LPRPvMHGoz5EOG0HY3w_DqBD1ReyDPFYZyg/edit?usp=sharing)



# Class format

* meeting face-to-face once a week
* series of 12 in-class notebooks
* about 5-6 "segments" per class
* often (but not always): 1st half of class is interactive; 2nd half is traditional "lab"

# Late policy
* You have 3 penalty-free late days
* One late day = one 24-hour period after due date
* No fractional late days: all or nothing
* 25% penalty per late day after late days used up
* You don't need to explain late days
* We track them for you
* Submit late assignments via Canvas (like usual)

# Original work policy
Unless otherwise specified in an assignment all submitted work must be your own, original work. Any excerpts, statements, or phrases from the work of others must be clearly identified as a quotation, and a proper citation provided. Any violation of the School’s policy on Academic and Professional Integrity (stated in the Master’s and Doctoral Student Handbooks) will result in serious penalties, which might range from failing an assign­ment, to failing a course, to being expelled from the program. Violations of academic and professional integrity will be reported to UMSI Student Affairs. Consequences impacting assignment or course grades are determined by the faculty instructor; additional sanctions may be imposed by the assistant dean for academic and student affairs. 

# Accommodations for students with disabilities
If you think you need an accommodation for a disability, please let me know at your earliest convenience. Some aspects of this course, the as­signments, the in-class activities, and the way we teach may be modified to facilitate your participation and progress. As soon as you make me aware of your needs, we can work with the Oﬃce of Services for Students with Disabilities (SSD) to help us determine appropriate accommoda­tions. SSD (734-763-3000; ssd.umich.edu/) typically rec­ommends accommodations through a Verified Individualized Services and Accommodations (VISA) form. I will treat any information that you provide in as confidential a manner as possible. 

# Student mental health and wellbeing
The University of Michigan is committed to advancing the mental health and wellbeing of its students, while acknowledging that a variety of issues, such as strained relationships, increased anxiety, alcohol/drug problems, and depression, directly impacts students' academic performance.
If you or someone you know is feeling overwhelmed, depressed, and/or in need of support, services are available. For help, contact Counseling and Psychological Services (CAPS) at (734) 764-8312 and https://caps.umich.edu/ during and after hours, on weekends and holidays or through its counselors physically located in schools on both North and Central Campus. You may also consult University Health Service (UHS) at (732) 764-8320 and https://www.uhs.umich.edu/mentalhealthsvcs, or for alcohol or drug concerns, see www.uhs.umich.edu/aodresources.

# Questions?

# Getting set up
* [Canvas](https://umich.instructure.com/courses/267556)
* [Slack](https://si618wn2019.slack.com)
* Jupyter ([Anaconda](https://www.anaconda.com))
* [GitHub](https://github.com/umsi-data-science/si618)
* [Kaggle](https://www.kaggle.com/)


## Canvas
* institutional learning management system
* you'll find assignments and grades here

## Slack

* group communication tool
* primary communication tool in this course (instead of email)

## Jupyter and JupyterLab
* What is Jupyter?
> The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text.


## Why Jupyter?
 * Interactive, reproducible results, literate programming, REPL (read-eval-print loop)
 * great for data exploration
 
## Why JupyterLab?
 * next-generation UI for Jupyter

## Jupyter and Python

* in the beginning: Python
* later: IPython
* still later: Jupyter notebooks
 * not just python (R, Julia, etc.)
* different from scripting
* great for data analysis


## What's wrong with Jupyter notebooks?
* not great for software engineering (see Joel Grus' ["I don't like notebooks"](https://docs.google.com/presentation/d/1n2RlMdmv1p25Xy5thJUhkKGvjtV-dkAIsUXP-AL4ffI/edit#slide=id.g362da58057_0_1) presentation)


In [3]:
from IPython.display import YouTubeVideo
# a talk by Joel Grus on why he doesn't like notebooks
YouTubeVideo('7jiPeIFXb6U')

## GitHub

* covered in SI 506 and SI 507
* [git installation instructions](https://www.atlassian.com/git/tutorials/install-git)
* [si618 repo](https://github.com/umsi-data-science/si618)
* nbviewer

## Binder
* create interactive notebooks from a git repo
 * technically, it uses [repo2docker](https://github.com/jupyter/repo2docker) to create Docker images from a git repo
* useful in case you can't run Jupyter on your laptop

## Kaggle
* online community for data scientists
* facilitates publishing data and "kernels"
* not only Python
* competitions
* most of the course datasets are from Kaggle
* there will be opportunities for you to publish your work on Kaggle

## Questions?

## Next steps
1. Take a 10-minute break
1. Follow the invitation link to Slack (see Canvas Announcements)
1. Install [anaconda](https://www.anaconda.com/download/) (if you haven't already)
3. Install git (if you don't already have it) and clone the [si618 GitHub repo](https://github.com/umsi-data-science/si618)
2. Start JupyterLab, either from the command line in Terminal (Mac) or PowerShell (Windows) or using Anaconda-Navigator.
2. Open today's notebook.
4. We'll start working the lab together, share some insights, submit your first notebook for the class, and talk about prepping for next class

## Learning Objectives

* install and run JupyterLab
* ensure you can use needed libraries
* clone the si618 github repo and make a working copy of the notebook(s) you'll be working on
* be able to run a class notebook
* write your first code in this class
* confirm that you've set up Anaconda, Slack, GitHub
* practice submitting an assignment

### IMPORTANT: Replace ```?``` in the following code with your uniqname.

In [None]:
MY_UNIQNAME = '?'

Let's answer a couple of questions about the course:

### <font color="magenta">Q1: (2 points) What are you looking forward to learning in this class?  

Insert your answer here.

### <font color="magenta">Q2: (2 points) What are you most concerned about in this class?

Insert your answer here.

### <font color="magenta">Q3: (4 points) Setup confirmation</font>

Answer the following:

* Do you have access to the [Canvas](https://umich.instructure.com/courses/267556) site?
 
 
* Do you have access to the [Slack](https://si618wn2019.slack.com) site?
 
 
* Have you installed Jupyter ([Anaconda](https://www.anaconda.com))?  What version of Python are you using?  What version of Jupyter lab are you using?
 

* Were you able to clone the [course repo]((https://github.com/umsi-data-science/si618)) on GitHub?  Do you have a GitHub account (we recommend that you get one)?  What's the URL to your GitHub user page?
 

* Did you set up a [Kaggle](https://www.kaggle.com/) account?  What's the URL to your Kaggle user account?
 


First, let's load the pandas library (which we'll cover in more detail next time):

In [None]:
import pandas as pd

Execute the next block . You will see what the DataFrame looks like. It includes the number of births for each name/gender combination for each year. "df_" means DataFrame.

In [None]:
df_names = pd.read_csv('data/names.csv')

In [None]:
df_names.head()

We want to visualize the number of births of a specific baby name (e.g., "Mike") across the years. First, execute the following code to select birth records just for male babies named "Mike".

In [None]:
df_mike = df_names[(df_names.name == 'Mike') & (df_names.gender == 'M')]

In [None]:
df_mike.head()

To create a plot, execute the following code. The first line starting with "%" is a command that enables inline plotting for the matplotlib library in Jupyter Notebook. This line needs to be executed only once per session.

In [None]:
%matplotlib inline
df_mike.plot('year', 'birth_count', title='"Mike"')

### <font color="magenta">Q4: (12 points)</font>
Repeat the above steps to create a plot for another name of your choice. Pay attention to selecting an appropriate gender ("M" or "F").


In [None]:
# insert your code here

### <font color="magenta">Q5: (16 points)</font>
Complete the implementation of the function below.  Use the function to make plots for as many names as you like but create at least three. Then, identify at least three interesting names and explain, using complete sentences, why they're interesting.

In [None]:
def plot_trend(name, gender):
    # TODO: complete your function here
    #       so that it creates a plot for the specified name and gender.
    #       e.g., plot_trend("Mike", "M") should generate the same plot as above.
    #       You will need to replace the ```pass``` statment in the following line
    pass

In [None]:
# call the above function with different names and genders

Use this space to explain why these plots are interesting.

# <font color="magenta">END OF NOTEBOOK</font>
## Remember to submit this file in IPYNB and HTML format via Canvas.

# Preparing for next class

See Syllabus

1. Review python basics: McKinney Chapters 1, 2, and 3
2. Study McKinney Chapters 4, 5, and 6, plus "Data Wrangling with Python & Pandas"