# Data Science Worksheet

## 1. Ask our question

Which aspects of Python programming most commonly confuses programmers?

## 2. Get the data

Stack Overflow has a lot of relevant data.  For the purposes of today's exercise, let's pretend
that Stack Overflow does not an API.

So which publicly accessible pages hold that data?

In [5]:
SO_URL = "https://stackoverflow.com/questions/tagged/{}?page={}&sort=frequent&pagesize=50"
# print(SO_URL.format("python", 1))

So we want the HTML from `SO_URL`, which is available via the HTTP
protocol that a browser users.  How do we send that request in Python?

### the `requests` library

In [9]:
import requests

SO_response = requests.get(SO_URL.format('python', 1))

if SO_response.status_code == 200:
    SO_page_html = SO_response.text

# print(SO_page_html)

### Data wrangling

So now we have our raw (primary) data.  It needs to be cleaned and structured.



In [22]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(SO_page_html, "html.parser")

question_summaries = soup.find_all("div", class_='question-summary')

# print(len(question_summaries))
# print(question_summaries[0])

Drill down further.

In [46]:
dataset = []

for summary in question_summaries:
#     print(type(summary))
    question = summary.find('a', class_='question-hyperlink').text
    views = int(summary.find('div', class_='views')['title'].split(" ")[0].replace(",", ""))
    vq = (views, question)
    
    dataset.append(vq)

print(dataset)

[(103166, '“Least Astonishment” and the Mutable Default Argument'), (86055, 'How do I test multiple variables against a value?'), (989877, "Understanding Python's slice notation"), (202286, 'Asking the user for input until they give a valid response'), (15088, 'List of lists changes reflected across sublists unexpectedly'), (853640, 'How to clone or copy a list?'), (74704, 'How do I create a variable number of variables?'), (533807, 'How do you split a list into evenly sized chunks?'), (855733, 'How do I pass a variable by reference?'), (280455, 'Remove items from a list while iterating'), (943642, 'Making a flat list out of list of lists in Python'), (5184, 'How to make good reproducible pandas examples'), (403271, 'How can I read inputs as integers?'), (1555307, 'What does the “yield” keyword do?'), (192712, 'Short Description of the Scoping Rules?'), (340544, 'What does ** (double star/asterisk) and * (star/asterisk) do for parameters?'), (2272003, 'Calling an external command in Py

In [54]:
dataset2 = []
for vq in dataset:
    views, question = vq
    words = question.split(" ")
    for word in words: 
        dataset2.append((views, word))

# dataset

# print(dataset2)

## 3. Explore the data.

In [60]:
import pandas as pd

df = pd.DataFrame(dataset2, columns=("votes", "word"))

print(df)

      votes           word
0    103166         “Least
1    103166  Astonishment”
2    103166            and
3    103166            the
4    103166        Mutable
5    103166        Default
6    103166       Argument
7     86055            How
8     86055             do
9     86055              I
10    86055           test
11    86055       multiple
12    86055      variables
13    86055        against
14    86055              a
15    86055         value?
16   989877  Understanding
17   989877       Python's
18   989877          slice
19   989877       notation
20   202286         Asking
21   202286            the
22   202286           user
23   202286            for
24   202286          input
25   202286          until
26   202286           they
27   202286           give
28   202286              a
29   202286          valid
..      ...            ...
390   74322             to
391   74322        iterate
392   74322           over
393   74322              a
394   74322           list
3

In [113]:
for x, i in df.iterrows():
    if "==" in i['word']:
        print(x, i['votes'], i['word']) 

# df2_by_word = df.groupby(by='word').mean()

# df['votes']  = df['votes'].astype(float)

# print(df.dtypes)

df2 = df.groupby('word', as_index=False).sum()
print(df2)

138 1462424.0 ==
255 9040.0 ==
325 1075055.0 '=='
votes    float64
word      object
dtype: object
                word       votes
0               '=='   1075055.0
1             'eval'     21259.0
2               'is'   1075055.0
3                '…'    349835.0
4                (an     82813.0
5            (double    340544.0
6    (star/asterisk)    340544.0
7                  *    340544.0
8                 **    340544.0
9                  -    349835.0
10                 0    480121.0
11                 3    172003.0
12                7?    713468.0
13                ==   1471464.0
14          Argument    103166.0
15            Asking    202286.0
16     Astonishment”    103166.0
17            Button      9861.0
18           Calling   2272003.0
19        Converting   1473099.0
20           Default    103166.0
21       Description    192712.0
22        Difference    410077.0
23          Division    480121.0
24              Does   1168429.0
25           Flatten     82813.0
26        F

In [119]:
word_frequencies = [tuple(x) for x in df2.values]

In [122]:
print(len(word_frequencies))

247


Now let's think more about the data via visualization:

In [1]:
from wordcloud import WordCloud

ModuleNotFoundError: No module named 'wordcloud'

## 4. Model the data (optional)

## 5. Communicate the data.