# Popular Data Science Questions
Our goal in this project is to use Data Science Stack Exchange to determine what content a data company should create, based on interest by subject.

# Exploring the Data


In [14]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [15]:
questions = pd.read_csv("2019_questions.csv", parse_dates=["CreationDate"])
questions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8839 entries, 0 to 8838
Data columns (total 7 columns):
Id               8839 non-null int64
CreationDate     8839 non-null datetime64[ns]
Score            8839 non-null int64
ViewCount        8839 non-null int64
Tags             8839 non-null object
AnswerCount      8839 non-null int64
FavoriteCount    1407 non-null float64
dtypes: datetime64[ns](1), float64(1), int64(4), object(1)
memory usage: 483.5+ KB


In [16]:
questions.count()

Id               8839
CreationDate     8839
Score            8839
ViewCount        8839
Tags             8839
AnswerCount      8839
FavoriteCount    1407
dtype: int64

Only the FavoriteCount column has missing values.  Also FavoriteCount column could be stored as an int64, not float64.

# Cleaning the Data
Filling in FavoriteCount with 0s if there is an NaN value:

In [17]:
questions.fillna(value={"FavoriteCount": 0}, inplace=True)
questions["FavoriteCount"] = questions["FavoriteCount"].astype(int)
questions.count()

Id               8839
CreationDate     8839
Score            8839
ViewCount        8839
Tags             8839
AnswerCount      8839
FavoriteCount    8839
dtype: int64

In [18]:
#Making the tags data easier to work with:

questions["Tags"] = questions["Tags"].str.replace("^<|>$", "").str.split("><")
questions.sample(2)

Unnamed: 0,Id,CreationDate,Score,ViewCount,Tags,AnswerCount,FavoriteCount
3216,48056,2019-03-26 22:07:29,1,139,"[orange, orange3]",1,1
7187,43699,2019-01-09 01:15:04,0,514,"[machine-learning, neural-network, deep-learni...",2,0


# Most Popular Used and Viewed Tags

In [24]:
tag_freq={}
for tags in questions['Tags']:
    for tag in tags:
        tag_freq[tag]= tag_freq.get(tag, 0)+1
        
tag_count= pd.DataFrame.from_dict(tag_freq, orient='index')
tag_count.rename(columns={0: "Count"}, inplace=True)
tag_count.head()

Unnamed: 0,Count
multilabel-classification,92
regression,347
data-transfer,1
orange,64
linux,5


In [27]:
most_common= tag_count.sort_values(by="Count").tail(30)
most_common

Unnamed: 0,Count
feature-engineering,163
xgboost,165
pytorch,175
linear-regression,175
data-science-model,186
reinforcement-learning,203
feature-selection,209
image-classification,211
data,213
data-mining,217


In [None]:
most_common.plot(kind='barh', figsize= (16,8))

<matplotlib.axes._subplots.AxesSubplot at 0x7fc13a63a7b8>