# Using Data Science To Identify Confusion Amongst Python Programmers 

Today's Goal: Get started with data science workflows and tooling in a real-life scenario.

## It's a Typical Monday At Work And....

Harriet Human-Resources, the VP of Training and Interviewing, comes in and says 

> We need to figure out a better way to assess candidates for skills in particular programming languages.  We also need to make our internal training programs for teaching employees better.  The CTO says we're going to build most things in Python. 

> We’re going to put a team on these problems, but they need more information about how to .  I know you’re busy with 100 other things, but can you give us some insight at Friday’s meeting? 

> "Maybe, after the meeting, we can have a talk about that raise you asked for.”

## What do you do?

For starters, you're going to set up this Jupyter notebook to keep track of your work.  Hit `ESC` + `h` to see the Jupyter Help menu.

### Setting up the Worksheet

Put these at the top of every notebook, to get [automatic reloading](# https://ipython.org/ipython-doc/3/config/extensions/autoreload.html) of imported modules and inline plotting:

In [2]:
%reload_ext autoreload 
%autoreload 2
%matplotlib inline 

## A Workflow For Thinking About the Problem And The Solution

1. Identify problem area.
2. Collect and clean the data.
3. Explore the data
4. Model data (optional).
5. Communicate and visualize results.
6. Repeat various steps until you and/or a domain expert is satisfied.

## Step 1: Identify The Question
----------------------------------------

In which areas do Python programmers struggle the most?  

## Step 2: Identify and collect the data
------------------------------------------------

Stack Overflow has a lot of excellent data.  

Say that you don't want to use their API for some reason.  So instead you're going to scrape their pages.  Go find the best pages for Python information and scrape them.

In [None]:
STACK_OVERFLOW_URL = "https://stackoverflow.com/questions/tagged/python?sort=votes&pageSize=15"

This is a rather small batch of data, but it's good to have a small batch to experiment with.

In [None]:
# STACK_OVERFLOW_URL = "https://stackoverflow.com/questions/tagged/python?page=1&sort=frequent&pagesize=50"

### The `requests` library

In [None]:
import requests

[Requests](http://docs.python-requests.org/en/master/) is a library that provides a pleasing API wrapper for HTTP requests, i.e., makes sending cumbersome-looking requests out of pleasing Python code.  Getting the text from a webpage is as simple as one line of code:

In [None]:
stack_overflow_response = requests.get(STACK_OVERFLOW_URL)

stack_overflow_html = stack_overflow_response.content

# print(stack_overflow_html)

However, this data is just a mass of text at this point.  It needs to be parsed into a usable data structure....

## Step 3: Explore the Data
----------------------------------

### The `BeautifulSoup` library

[Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/) is the most popular HTML parsing in the Python package ecosystem. You tell it which parsing engine you want it to use, feed in an HTML text file, and its top-level object returns a parsable data structure (a Python version of the DOM).

In [None]:
from bs4 import BeautifulSoup
PARSER = 'html.parser'

soup = BeautifulSoup(stack_overflow_html, PARSER)

# questions = soup.find_all("a", "question-hyperlink")
questions = soup.find_all("div", class_="question-summary")

# print(questions)

OK, so now we have an array of Beautiful Soup objects:

In [None]:
print('TYPE: ', type(questions[0]))

Now use the `text` method on the Beautiful Soup Tag object to get the meat
of the data that you want:

In [None]:
dataset = []

for i in range(0, len(questions)):
#     print(questions[i])
    text = questions[i].find(class_='question-hyperlink').text
    print(questions[i].find(class_='supernova'))
    votes = questions[i].find(class_='supernova')['title'].split(" ")[0].replace(',','')
#     print(text, votes)
    split_text = text.split(" ")
    
    for x in range(len(split_text)):
        row = (split_text[x], int(votes))
        dataset.append(row)

print(dataset)

Normally I would extract 100s of pages of this data and put it in separate files.

But today, we're just looking for an in-memory example.  We will instead output it into a Python array.

In [None]:
# Use this code if I want to switch between files.

# with open('data.txt', 'r') as f:
#     res = f.readlines()

# res = [x.split(" ") for x in [y.strip() for y in res]]

# flattened_res = [item for sublist in res for item in sublist]


## Step 3: Exploratory Data Analysis


### `ndarray` -- n-dimensions of your data.

So this data looks like it has two dimensions - the number of votes you have and all of the text. Let's separate that out with some simple Python.

In [None]:
import numpy as np

topics = np.array(dataset, dtype=(string, int))

print(topics)

### Pandas Dataframes

In [None]:
import pandas as pd

df = pd.DataFrame(dataset, columns=["word", "votes"])
print(df)

In [None]:
df2 = df.groupby(by='word').mean()
# print(df.

In [None]:
print(df2)

### Pandas DataFrames

## Step 4: Model The Data
---------------------------------

## Step 5: Visualize and Communicate The Data
------------------------------------------------------------

In [None]:
words1 = df2[0:10]
print(words1)



In [None]:
d = {}
for x in words1:
    print(x)

import matplotlib.pyplot as plt
from wordcloud import WordCloud

print(dir(WordCloud))

wordcloud = WordCloud()
wordcloud.generate_from_frequencies(frequencies=d)
plt.figure()
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

## Step 6: Repeat As Necessary

## Appendix A: Resources