# Exploratory Data Analysis in Python

<hr>

## What we'll cover:

* [Are spreadsheets good enough?](#Are-spreadsheets-good-enough?)
* [When to automate?](#When-to-automate?)
* [Which language should I use?](#Which-language-should-I-use?)
* [What is Python and Python Setup](#What-is-Python-and-Python-Setup)
* [Where to get help?](#Where-to-get-help?)
* [Using Peer and Mob Coding Sessions](#Using-Peer-and-Mob-Coding-Sessions)
* [Using DevSecOps to stay on track](#Using-DevSecOps-to-stay-on-track)
* [Where do I save my code?](#Where-do-I-save-my-code?)
* [Demos!](#Demos!)

<hr>

<hr>

## Are spreadsheets good enough?

<hr>

As mentioned in the previous section, do a quick assessment to ensure that an involved and possibly complex programatic solution is necessary. There is nothing more frustrating than setting up environments and repos to find out that you could have made a few clicks in a GUI application to the same effect. In the broader scope of time management, choose the most efficient method to get the answers you need from the data. If time permits, go back and attempt to replicate what you've done in a spreadsheet with Python, but not at the expense of pushing back deadlines and the contributions of other team members.

As you do more with Python, your collection of code snippets and gists will grow making quick EDA tools as readily usable as a spreadsheet. Until then, be smart about tool selection.

<hr>

## When to automate?

<hr>

Opinions may vary on this, but in general the question of when to automate data analysis with Python comes down to knowing the limitations and performance of your available tooling. More specifically, you will have to know two initial things in the context of the data and the available analytic tools:

* the Shape of the data
* the Complexity of the data

The shape of data refers to the scale and size of the data. The number of columns and rows combined with the types of variables will help guide you a particular set of tools. The complexity of the data refers to the dimensionality of the data collection and your ability to query that data for analysis. Data source(s), connected tables, data types, messiness, and missing data will help you understand which tools are best suited to address one or more of these issues.

After assessing shape and complexity, you will have a more specific sense of what needs to be coded and how that code will integrate into the overarching analytic effort.

<hr>

## Which language should I use?

<hr>

Python is the focus of this course, but there are similar programming languages builts for data analysis you can choose from:

* R is a language built for statistical computing
* SQL, or structured query language, are DSLs built for use with relational databases such as PostgreSQL
* Julia is a language similar to Python and works well for numerical analysis
* Scala is a language built to facilitate a broader array of programming paradigms, but is fast and works optimally with Spark


<hr>

## What is Python and Python Setup

<hr>

### What is Python?

* "Interpreted high-level programming language" (Wikipedia)
  * JIT (Just-in-time compile)
  * Multiple interpreters (Cpython is default)
* General-purpose programming
* Created by Guido van Rossum (1991)
* Named after Monty Python

## Why Python?

* Simple to Use!
* Broad Spectrum of Application
* Full Stack Solutions
* Numerous 3rd Party Libraries
* Active, Open Source Community

### Setup - Operating Systems

This is the fork in the road where you decide how to setup your system. If you're looking to use Anaconda, Enthought, or another managed Python data ecosystem, skip ahead to the [Where to get help?](#Where-to-get-help?) section.

* Go here for your respective OS download
  * https://www.python.org/
* Windows
  * load executable installer or ZIP
  * Add to Path!
* Mac OS
  * Python for Mac OS
  * For latest versions, have to download
* Linux
  * Just the best ;)
  * Already have Python and Legacy Python installed

### Setup - Package Management

* PyPI
  * [Python Package Index](https://pypi.python.org/pypi)
  * pip3 package utility
    * May need to install depending on OS
  * Anyone can publish their packages here

### Setup - Virtual Environments

* Why?
  * Maintain project environment
  * Prevent clutter
    * System packages – Python Standard library
    * Site packages - 3rd party libraries
  * Prevent package version issues
  * Makes sharing your research/projects easier

### Setup - Python Virtual Environments

* Python has a [default virtual environment tool](https://docs.python.org/3/library/venv.html)
* To create your project directory with virtual environment
  * `python3 -m venv /path/to/new/virtual/environment`
  * `cd /path/to/new/virtual/environment`
  * `source bin/activate`
* To create a requirements.txt
  * activate the virtual environment
  * `pip3 freeze > requirements.txt`

### Setup - Pre-built environments

As stated above, there are other "simplified" options for setting up your Python data analysis environment. There are a few:

* [Anaconda](https://www.anaconda.com/products/individual)/[Miniconda](https://docs.conda.io/en/latest/miniconda.html)
* [Enthought](https://docs.enthought.com/ets/#)
* [SageMaker](https://aws.amazon.com/sagemaker/)
* [Google Collab](https://colab.research.google.com/notebooks/intro.ipynb)

Depending on project requirements or personal preference, any of these would work great for data analysis with Python and in some cases would be prefered for bigger data sets or more resource-intense methods.

### Setup - Anaconda Virtual Environments for Python

* Anaconda is a popular data science environment and for Windows users may be the simplest option
  * Package management/virtual environments are unique to their system
  * created using their GUI interface or at the command line using `conda`
* Setup instructions can be found on their website

### Setup - IDEs

* Python comes with a REPL and Idle (not really ideal for data analysis)
* You can use Integrated Development Environments (IDEs) which help with tooling and many useful features for coding in Python
  * [Jupyter Lab](https://jupyter.org/) (what we'll be using for course)
    * Jupyter Notebooks
    * iPython
  * [Spyder](https://www.spyder-ide.org/) (RStudio feel for Python)
  * [PyCharm](https://www.jetbrains.com/pycharm/) (has Jupyter support and educational tooling for beginners)
  * [Sublime](https://www.sublimetext.com/) (text editor with plugins built in Python)
  * [Atom](https://atom.io/) (just use VS Code)
  * [Visual Studio Code](https://code.visualstudio.com/) (text editor that seem unstoppable at this point, get Python support)
  * [Notepad++](https://notepad-plus-plus.org/) (Windows, but seriously just use VS Code)
  * Any text editor really...
    * Beyond this course, you'll probably use Jupyter Notebooks for most data analysis, but as you branch out into bigger programs, VS Code is recommended)
      * iPython/Jupyter and [NeoVIM](https://neovim.io/) is what I REALLY recommend, but for your sanity I have to say Jupyter and VS code :D 


<hr>

## Where to get help?

<hr>

Your experienced colleagues are always a great resource for data analysis tips and guidance. There are also several resources online. 

* Google it all!
* [Stack Overflow](https://stackoverflow.com/) FTW!


<hr>

## Using Peer and Mob Coding Sessions

<hr>

Peer and/or mob coding are great ways to work and learn as a group. In both scenarios, each person takes turns typing while the rest of the group tells them what to write and how to structure the code. This allows more experienced personnel to guide newer team members in a relaxed environment. These sessions generally occur at the beginning of a project or during development of more challenging features, but the decision is ultimately up to two (e.g., peer) or more (e.g., mob) people to coordinate when/where these are most needed.


<hr>

## Using DevSecOps to stay on track

<hr>

Once you know the data and the tools you need for analysis, it is important to have a system through which to assign tasks and keep track of progress towards task completion. Development and Security Operations (DevSecOps) are a series of frameworks used to organize technical tasks in a project. There are several tools a project lead can use for this:

* [Asana](https://asana.com/)
* [Trello](https://trello.com/)
* [Jira](https://www.atlassian.com/software/jira)

This also includes more qualitative tasks such as documentation and task sprints.


<hr>

## Where do I save my code?

<hr>

Last, but certainly not least, you will need to preserve all the elements of your data analysis. This include the Python code you've produced, the raw data, the processed main data used for analysis, and any documentation on what you did, how, and why. Thankfully there are ways for us to back-up our materials as we go. Version control software (VCS) is a way to keep track of changes to our code and data in addition to keeping a back-up with records of the analysis process. The technical lead for a project should set this up with clear direction on how to implement it for your contributions.

As of now, GitHub is the most ubiquitous VCS out there and for good reason. It not only provides free VCS, it allows you to perform DevSecOps tracking, documentation via Markdown files and wikis, gists for testing code, Jupyter Notebook support, static website creation, and much more. They even have GUI and CLI tools to make the VCS process easier and accessible.

### Version Control Software

** Always back-up your projects with a VCS **

#### Git is an industry standard version control system (the underlying VCS for the most popular tools)

* Github - Go [here](https://guides.github.com/) to get started
* GitLab - Go [here](https://docs.gitlab.com/ee/README.html#getting-started-with-gitlab) to get started
* BitBucket - Go [here](https://www.atlassian.com/git/tutorials) to get started

If you do not feel like learning the Git CLI at first, you can use a GUI Git Client to clone, modify, and push your code.

* For Mac and Windows, use [Github Desktop](https://desktop.github.com/)
* For Linux, use the CLI


<hr>

## Demos!

<hr>

In [None]:
# Python Easter Egg!
import antigravity

In [None]:
# Another Python Easter Egg!
import this

In [None]:
# A simple app to open multiple search tabs on a browser for same query

# from core Python3, import open_new_tab module from webbrowser library
from webbrowser import open_new_tab

# Define list of search engine URLs
websites = [
    "https://www.google.com/search?q=",
    "https://duckduckgo.com/?q=",
    "https://search.yahoo.com/search?p=",
    "https://www.bing.com/search?q=",
    "https://www.ask.com/web?q=",
    "https://www.startpage.com/do/dsearch?query="
]

# Define the search query
query =  # what do we need here?

# For each URL in the list
# open a new tab and populate the URL with the search engine and query 
for i in websites:
    open_new_tab()  # what should we give our function here?


In [None]:
# An app with a function which opens multiple search tabs on a browser for same query

# from core Python3, import open_new_tab module from webbrowser library
from webbrowser import open_new_tab

# Define a reusable function named 'search' that takes arguments for a search phrase and engines
def search(search_phrase, search_engines):
    # For each URL in the list
    for i in search_engines:
        # open a new tab and populate the URL with the search engine and query 
        open_new_tab(i)  # does this do what we want?


In [None]:
# use our search function for our query
# place your code below



## Resources

* Public/Univeristy Libraries
  * Books, E-books, and online education resources
* [Pycoder's Weekly](http://pycoders.com/)
* [KDNuggets](https://www.kdnuggets.com/news/index.html)
* [Data Science Weekly](https://www.datascienceweekly.org/newsletters/data-science-weekly-newsletter-issue-217)
* [DataCamp](https://www.datacamp.com/)
* [Codecademy](https://www.codecademy.com/)
* [Python Packages](https://packaging.python.org/tutorials/installing-packages/)
* [Real Python](https://realpython.com/)
* [Full Stack Python](https://www.fullstackpython.com/)
* [PyCon talks](http://pyvideo.org/)
* [Dan Bader’s stuff](https://dbader.org/)