# Data Science Worksheet

In [1]:
%matplotlib inline
%reload_ext autoreload
%autoreload 2

## 1. Ask our question

Which aspects of Python programming most commonly confuses programmers?

## 2. Get the data

Stack Overflow has a lot of relevant data.  For the purposes of today's exercise, let's pretend
that Stack Overflow does not an API.

So which publicly accessible pages hold that data?

In [2]:
SO_URL = "https://stackoverflow.com/questions/tagged/{}?page={}&sort=frequent&pagesize=50"
# print(SO_URL.format("python", 1))

So we want the HTML from `SO_URL`, which is available via the HTTP
protocol that a browser users.  How do we send that request in Python?

### the `requests` library

In [3]:
import requests

SO_response = requests.get(SO_URL.format('python', 1))

if SO_response.status_code == 200:
    SO_page_html = SO_response.text

# print(SO_page_html)

### Data wrangling

So now we have our raw (primary) data.  It needs to be cleaned and structured.



In [4]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(SO_page_html, "html.parser")

question_summaries = soup.find_all("div", class_='question-summary')

# print(len(question_summaries))
# print(question_summaries[0])

Drill down further.

In [5]:
dataset = []

for summary in question_summaries:
#     print(type(summary))
    question = summary.find('a', class_='question-hyperlink').text
    views = int(summary.find('div', class_='views')['title'].split(" ")[0].replace(",", ""))
    vq = (views, question)
    
    dataset.append(vq)

print(dataset)

[(103221, '“Least Astonishment” and the Mutable Default Argument'), (86148, 'How do I test multiple variables against a value?'), (990680, "Understanding Python's slice notation"), (202566, 'Asking the user for input until they give a valid response'), (15102, 'List of lists changes reflected across sublists unexpectedly'), (854576, 'How to clone or copy a list?'), (74765, 'How do I create a variable number of variables?'), (534278, 'How do you split a list into evenly sized chunks?'), (856361, 'How do I pass a variable by reference?'), (280706, 'Remove items from a list while iterating'), (944774, 'Making a flat list out of list of lists in Python'), (5211, 'How to make good reproducible pandas examples'), (403889, 'How can I read inputs as integers?'), (1556456, 'What does the “yield” keyword do?'), (192858, 'Short Description of the Scoping Rules?'), (340961, 'What does ** (double star/asterisk) and * (star/asterisk) do for parameters?'), (2273641, 'Calling an external command in Py

In [6]:
dataset2 = []
for vq in dataset:
    views, question = vq
    words = question.split(" ")
    for word in words: 
        dataset2.append((views, word))

# dataset

# print(dataset2)

## 3. Explore the data.

In [7]:
import pandas as pd

df = pd.DataFrame(dataset2, columns=("votes", "word"))

print(df)

      votes           word
0    103221         “Least
1    103221  Astonishment”
2    103221            and
3    103221            the
4    103221        Mutable
5    103221        Default
6    103221       Argument
7     86148            How
8     86148             do
9     86148              I
10    86148           test
11    86148       multiple
12    86148      variables
13    86148        against
14    86148              a
15    86148         value?
16   990680  Understanding
17   990680       Python's
18   990680          slice
19   990680       notation
20   202566         Asking
21   202566            the
22   202566           user
23   202566            for
24   202566          input
25   202566          until
26   202566           they
27   202566           give
28   202566              a
29   202566          valid
..      ...            ...
390   74377             to
391   74377        iterate
392   74377           over
393   74377              a
394   74377           list
3

In [8]:
for x, i in df.iterrows():
    if "==" in i['word']:
        print(x, i['votes'], i['word']) 

# df2_by_word = df.groupby(by='word').mean()

# df['votes']  = df['votes'].astype(float)

# print(df.dtypes)

df2 = df.groupby('word', as_index=False).sum()
print(df2)

138 1464217 ==
255 9051 ==
325 1075852 '=='
                word     votes
0               '=='   1075852
1             'eval'     21276
2               'is'   1075852
3                '…'    350265
4                (an     82846
5            (double    340961
6    (star/asterisk)    340961
7                  *    340961
8                 **    340961
9                  -    350265
10                 0    480488
11                 3    172070
12                7?    714354
13                ==   1473268
14          Argument    103221
15            Asking    202566
16     Astonishment”    103221
17            Button      9869
18           Calling   2273641
19        Converting   1474826
20           Default    103221
21       Description    192858
22        Difference    410380
23          Division    480488
24              Does   1169810
25           Flatten     82846
26        Flattening    144771
27               Get    100493
28               How  10144475
29                 I   554

In [9]:
word_frequencies = {x:y for x, y in df2.values}

In [10]:
print(word_frequencies)

{"'=='": 1075852, "'eval'": 21276, "'is'": 1075852, "'…'": 350265, '(an': 82846, '(double': 340961, '(star/asterisk)': 340961, '*': 340961, '**': 340961, '-': 350265, '0': 480488, '3': 172070, '7?': 714354, '==': 1473268, 'Argument': 103221, 'Asking': 202566, 'Astonishment”': 103221, 'Button': 9869, 'Calling': 2273641, 'Converting': 1474826, 'Default': 103221, 'Description': 192858, 'Difference': 410380, 'Division': 480488, 'Does': 1169810, 'Flatten': 82846, 'Flattening': 144771, 'Get': 100493, 'How': 10144475, 'I': 5548881, 'Is': 457556, 'List': 15102, 'Making': 944774, 'Mutable': 103221, 'NameError:': 350265, 'Python': 7014624, "Python's": 990680, 'Python?': 1013110, 'Remove': 280706, 'Rules?': 192858, 'Scoping': 192858, 'Short': 192858, 'Static': 823326, 'Syntax': 172070, 'Thread': 457556, 'True?': 9051, 'Unable': 765383, 'Understanding': 990680, 'Using': 2384272, 'What': 4822115, 'Why': 1126089, 'Windows?': 2082862, '[duplicate]': 325892, '__name__': 1464217, '__repr__': 410380, '_

Now let's think more about the data via visualization:

In [11]:
import matplotlib.pyplot as plt
from wordcloud import WordCloud
wordcloud = WordCloud().generate_from_frequencies(word_frequencies)

plt.figure(figsize=(18, 16))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")

ModuleNotFoundError: No module named 'wordcloud'

## 4. Model the data (optional)

## 5. Communicate the data.