-
Notifications
You must be signed in to change notification settings - Fork 0
/
Popular Data Science Questions.py
424 lines (244 loc) · 13.3 KB
/
Popular Data Science Questions.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
#!/usr/bin/env python
# coding: utf-8
# # Popular Data Science Questions
# by Nicholas Archambault
#
# The goal of this project is to use Data Science Stack Exchange -- a question and answer site for data scientists, machine learning professionals, and others interested in the field -- to determine which content a data science company should create based on the site's most popular subjects.
#
# Most questions on the site are practical, specific to data science, and relevant to other posts. The DSSE homepage divides into four main sections:
# * Questions - a list of all questions asked
# * Tags - a list of labels and keywords that classify questions
# * Users - a list of users
# * Unanswered - a list of unanswered questions
#
# The tagging system utilized by the site seems like a good way to quantify interest in particular subjects and proceed toward our goal of recommending content creation based on topics' popularity.
#
# Upon further examination, we find that exploring tags from the 'Posts' data table, a collection of all posts, will give us an idea of the most popular topics on the site. This information can be used to understand which content should be created in order to most directly appeal to the site's audience.
#
# We can query the posted question data we want from using the site's SQL interface:
#
# `SELECT Id, CreationDate,
# Score, ViewCount, Tags,
# AnswerCount, FavoriteCount
# FROM posts
# WHERE PostTypeId = 1 AND YEAR(CreationDate) = 2019;`
# ## Exploring the Data
#
# Read in the data, immediately ensuring that `Creation Date` is read as a datetime object. We'll limit our analysis to posts from 2019.
# In[1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
get_ipython().run_line_magic('matplotlib', 'inline')
# In[2]:
data = pd.read_csv("2019_questions.csv", parse_dates = ["CreationDate"])
# In[3]:
data.head(5)
# In[4]:
data.isnull().sum()
# In[5]:
data.shape
# In[6]:
data.info()
# The only column with missing values is 'FavouriteCount'. Of 8,839 values, 7,432 in that column are missing. Missing values likely mean the given question was not present in any user's favorite list. We can probably replace missing values with 0.
# ## Cleaning the Data
#
# Each listing in the "Tags" column is a string. We can clean these entries and keep them as lists of all the tags that apply to each post.
# In[7]:
# Start with 'FavoriteCount' column
data["FavoriteCount"] = data["FavoriteCount"].fillna(0).astype(int)
# In[8]:
# Continue with 'Tags'
data["Tags"] = data["Tags"].str.replace("><", ",").str.replace("<|>", "").str.split(",")
# ## Tags and Views
#
# To measure tag popularity, count how many times each tag was used and how many times a question with that tag was viewed.
# ### Tags
# In[9]:
# Create dictionary; increment value each time a tag is used
total_tags = {}
for row in data["Tags"]:
for i in row:
if i in total_tags:
total_tags[i] += 1
else:
total_tags[i] = 1
# In[10]:
# Transform into a dataframe for order
tag_count = pd.DataFrame.from_dict(data = total_tags, orient = "index")
# In[11]:
tag_count.rename(columns = {0:"Count"}, inplace = True)
tag_count = tag_count.sort_values(by = "Count")
most_used = tag_count.tail(20)
# The choice to view the 20 most popular tags is arbitrary, but we see that popularity declines rapidly and suspect that we are not neglecting any tags too important to the conclusions we'll draw.
# In[12]:
# Plot popular tags
most_used.plot(kind = "barh", figsize = (16,8))
plt.title("Most Used Tags")
plt.ylabel("Tag")
plt.show()
# Visualizing the data reveals that some tags -- including 'machine-learning' and 'dataset' -- are a bit too broad to be useful for suggesting new, specific content to be created.
#
# Let's see how the trends of overall views compare to the trends observed in tag popularity. We can repeat the same process of accumulating counts, converting to a dataframe, sorting the values, and plotting the results.
# ### Views
# Repeat previous steps.
# In[13]:
tag_views = dict()
for index, row in data.iterrows():
for i in row["Tags"]:
if i in tag_views:
tag_views[i] += row["ViewCount"]
else:
tag_views[i] = row["ViewCount"]
# In[14]:
total_views = pd.DataFrame.from_dict(tag_views, orient = "Index")
# In[15]:
total_views.rename(columns={0:"Views"}, inplace = True)
total_views = total_views.sort_values(by = "Views")
most_viewed = total_views.tail(20)
# In[16]:
most_viewed.plot(kind = "barh", figsize = (16,8))
plt.title("Most Viewed Tags")
plt.ylabel("tag")
plt.show()
# In[17]:
# View Tags and Views plots adjacently
fig, axes = plt.subplots(nrows=1, ncols=2)
fig.set_size_inches((24, 10))
most_used.plot(kind="barh", ax=axes[0], subplots=True)
most_viewed.plot(kind="barh", ax=axes[1], subplots=True)
plt.show()
# Viewing the plots side-by-side gives us a visual sense of the tags most shared by both dataframes. Merging these dataframes together yields a clearer numerical picture of common tags' occurrences.
# In[18]:
in_used = pd.merge(most_used, most_viewed, how = "left", left_index = True, right_index = True)
in_used
# In[19]:
in_viewed = pd.merge(most_viewed, most_used, how = "left", left_index = True, right_index = True)
in_viewed
# ## Relations Between Tags
#
# One way of gauging the relationship between pairs of tags is to track how often they appear together. We can accomplish this by creating a dataframe where each row and column is a specific tag and all cells are zeros. Each index location will be incremented when the given tag pair to which it corresponds is found in the original `data` object consisting of all posted questions from 2019.
# In[20]:
all_tags = list(tag_count.index)
# In[21]:
# Create new associations dataframe filled with zeros
associations = pd.DataFrame(index = all_tags, columns = all_tags)
associations.fillna(0, inplace = True)
# In[22]:
# Increment for each occurrence of association within the same post between two tags
for tags in data["Tags"]:
associations.loc[tags, tags] += 1
# To help understand what's going on in the dataframe, we can add color.
# In[23]:
relations_most_used = associations.loc[most_used.index, most_used.index]
def style_cells(x):
helper_df = pd.DataFrame('', index=x.index, columns=x.columns)
helper_df.loc["time-series", "r"] = "background-color: yellow"
helper_df.loc["r", "time-series"] = "background-color: yellow"
for k in range(helper_df.shape[0]):
helper_df.iloc[k,k] = "color: blue"
return helper_df
relations_most_used.style.apply(style_cells, axis=None)
# The cells highlighted in yellow tell us that 'time-series' was used together with 'r' 22 times. The values in blue tell us how many times each of the tags was used. We saw earlier that 'machine-learning' was used 2693 times and we confirm it in this dataframe.
# To make the tag-tag associations even clearer, we can utilize Seaborn to create a heatmap. Darker colors correspond to higher values and a stronger association between the usage of two particular tags.
# In[24]:
# Insert diagonal line of null associations between a tag and itself
for i in range(relations_most_used.shape[0]):
relations_most_used.iloc[i,i] = pd.np.NaN
# In[25]:
# Create heatmap
plt.figure(figsize=(12,8))
sns.heatmap(relations_most_used, cmap="Greens", annot=False)
# The most used tags also seem to have the strongest relationships, as given by the dark concentration in the bottom right corner. However, this could simply be because each of these tags is used often, and so end up being used together without any truly strong relation between them.
#
# Another shortcoming of this attempt is that it only looks at relations between pairs of tags and not between multiple groups of tags. For example, it could be the case that when used together, 'dataset' and 'scikit-learn' have a "strong" relation to 'pandas', but each by itself doesn't.
#
# Keras, scikit-learn, and TensorFlow are all Python libraries that allow their users to employ deep learning. Most of the top tags, including these three, are all intimately related with the theme of deep learning. If we want to be very specific, we can suggest that the most popular newly created content would be Python content that uses deep learning for classification problems.
# ## Tracking Deep Learning Questions
#
# In this final section, we'll build on the conclusion that most top tags are related to deep learning by charting the popularity of deep learning across multiple years. The results of this analysis will provide concrete insight about whether or not to recommend that deep learning and related themes be the focus of new content created and promoted on Data Science Stack Exchange.
#
# We'll start by reading in and cleaning the tags of a dataset containing all questions, not just those posted in 2019.
# In[26]:
# Read data and parse dates
all_q = pd.read_csv("all_questions.csv", parse_dates = ["CreationDate"])
# In[27]:
# Clean Tags column
all_q["Tags"] = all_q["Tags"].str.replace("><", ",").str.replace("<|>", "").str.split(",")
# To visualize the popularity of deep learning-related tags over time, we'll chart the number of posts containing these themes quarterly from Q2 of 2014 to Q4 of 2019.
#
# The following steps will increment a new binary 'DeepLearning' column in the dataframe `all_q` for all posts containing at least one deep learning-related tag, shown in the list below.
# In[28]:
dl_list = ["cnn", "lstm", "scikit-learn", "tensorflow", "keras", "neural-network", "deep-learning"]
# In[29]:
# Function that increments a binary variable for each post containing a deep learning tag
def dl_classification(tags):
for i in tags:
if i in dl_list:
return 1
return 0
# In[30]:
all_q["DeepLearning"] = all_q["Tags"].apply(dl_classification)
# In[31]:
all_q.sample(5)
# We find that the process has worked as expected; posts containing one of the specified deep learning tags are now marked as such.
#
# Since we don't have complete data for 2020, we'll filter out all posts from that year before writing a function that creates an `all_q` column denoting the quarter in which a post was composed. This column, along with the year of the post, will segment the data and prepare it for quarterly tracking.
# In[32]:
# Remove 2020 data
all_q = all_q[all_q["CreationDate"].dt.year < 2020]
# In[33]:
# Define quarter-assigning function
def quarter(datetime):
year = str(datetime.year)[-2:] # Strip year from date
month = int(datetime.month) # Strip month
# Assign post to particular quarter based on month
if month < 4:
quarter = 1
elif 4 <= month < 7:
quarter = 2
elif 7<= month < 10:
quarter = 3
else:
quarter = 4
# Return newly formatted year and quarter
return "{y}Q{q}".format(y=year, q=quarter)
all_q["Quarter"] = all_q["CreationDate"].apply(quarter)
# Armed with all the tools necessary to formulate our final analysis, we can group the `quarterly` data by quarter, chart the number of deep learning posts versus total posts, and calculate the ratio of these values.
# In[34]:
# Group data by quarter
quarterly = all_q.groupby("Quarter").agg({"DeepLearning":["sum", "size"]})
# Create new columns
quarterly.columns = ["DeepLearningQuestions", "TotalQuestions"]
# Find ratio of deep learning posts to total posts
quarterly["Ratio"] = quarterly["DeepLearningQuestions"] / quarterly["TotalQuestions"]
quarterly.reset_index(inplace = True)
# In[35]:
quarterly
# In[36]:
# Plot data
ax1 = quarterly.plot(x="Quarter", y="Ratio",
kind="line", linestyle="-", marker="o", color="orange",
figsize=(24,12)
)
ax2 = quarterly.plot(x="Quarter", y="TotalQuestions",
kind="bar", ax=ax1, secondary_y=True, alpha=0.7, rot=45, color = "gray")
for idx, t in quarterly["TotalQuestions"].iteritems():
ax2.text(idx, t, str(t), ha="center", va="bottom")
xlims = ax1.get_xlim()
ax1.get_legend().remove()
handles1, labels1 = ax1.get_legend_handles_labels()
handles2, labels2 = ax2.get_legend_handles_labels()
ax1.legend(handles=handles1 + handles2,
labels=labels1 + labels2,
loc="upper left", prop={"size": 12})
for ax in (ax1, ax2):
for where in ("top", "right"):
ax.spines[where].set_visible(False)
ax.tick_params(right=False, labelright=False)
plt.show()
# Plotting the per-quarter post totals and ratios of deep learning posts to total posts, we see that since 2014 deep learning has been a high-growth trend on Data Science Stack Exchange that is now starting to plateau. There's no evidence to suggest that on-site interest in it is waning, so we should proceed with our recommendation that newly created content for Data Science Stack Exchange focus on topics in deep learning.
# ## Conclusion
#
# The goal of this project was to explore the question and answer format of Data Science Stack Exchange and make a recommendation for the focus of new content based on the most popular trends and topics on the site. By exploring the tags under which posts are classified, we determined that the site features substantial collective interest in topics related to deep learning, a trend that has grown and stabilized over the course of the past 5+ years.