# Popular Data Science Questions

In this project, we're going to explore the Data Science Stack Exchange (DSSE) and answer popular data science questions that may arise on the job. The Data Science Stack Exchange is part of Stack Exchange which is a network of question-and-answer websites on topics in diverse fields. All the websites in the network are modeled after the initial site which is very popular among the programming community; Stack Overflow.

We're tasked with finding out "what is the best kind of content to write about". To answer this question, we're going to use the Data Science Stack Exchange's database which is a provided public data base from Stack Exchange. The following columns from the database will be useful for our analysis;

- Id: identification number for a post
- PostTypeId: identification number for the type of post
- Score: a posts score
- ViewCount: how many times a post has been viewed
- Tags: topics that the post is related to
- AnswerCount: how many times a post/question was answered
- FavoriteCount: how many times a post/question was favored
- CreationDate: date and time a post was created

# Import pandas & read in csv file

In [14]:
import pandas as pd
# read csv into a DataFrame and parse dates by 'CreationDate'
file = pd.read_csv('2019_questions.csv', parse_dates=['CreationDate'])

# Explore the dataset

In [15]:
# view all columns in dataset
file.columns

Index(['Id', 'CreationDate', 'Score', 'ViewCount', 'Tags', 'AnswerCount',
       'FavoriteCount'],
      dtype='object')

In [16]:
file.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8839 entries, 0 to 8838
Data columns (total 7 columns):
Id               8839 non-null int64
CreationDate     8839 non-null datetime64[ns]
Score            8839 non-null int64
ViewCount        8839 non-null int64
Tags             8839 non-null object
AnswerCount      8839 non-null int64
FavoriteCount    1407 non-null float64
dtypes: datetime64[ns](1), float64(1), int64(4), object(1)
memory usage: 483.5+ KB


In [17]:
file.isnull().sum()

Id                  0
CreationDate        0
Score               0
ViewCount           0
Tags                0
AnswerCount         0
FavoriteCount    7432
dtype: int64

Above we can see that there is 7,432 null values in the FavoriteCount column. We will fill all the missing values in the FavoriteCount column with 0 as well as convert the entire row to integers.

# Fill missing values and convert columns

In [18]:
# fill NaN values in FavoriteCount column
file['FavoriteCount'] = file['FavoriteCount'].fillna(0)

In [19]:
# convert FavoriteCount to type int
file['FavoriteCount'] = file['FavoriteCount'].astype(int)

In [22]:
file.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8839 entries, 0 to 8838
Data columns (total 7 columns):
Id               8839 non-null int64
CreationDate     8839 non-null datetime64[ns]
Score            8839 non-null int64
ViewCount        8839 non-null int64
Tags             8839 non-null object
AnswerCount      8839 non-null int64
FavoriteCount    8839 non-null int64
dtypes: datetime64[ns](1), int64(5), object(1)
memory usage: 483.5+ KB


# Tags column

There is a post on DSSE about how to correctly tag a question, the link to the post is ;

https://meta.stackexchange.com/questions/18878/how-do-i-correctly-tag-my-questions/18879#18879

Within the very detailed answer, we find out that;

"You are limited to five tags, and you are generally better off trying to use as many of them as you can, provided they follow the guidelines here."

Because of the limitation, we'll create 5 columns in the dataset labeled; tag1,tag2,tag3,tag4, and tag5 which will make the tags easier to work with. The tags are also seperated by < tag > so we'll pass a regular expression into str.replace which will clean this up by replacing them with an empty string.

In [25]:
file['Tags'].head(3)

0                      <machine-learning><data-mining>
1    <machine-learning><regression><linear-regressi...
2         <python><time-series><forecast><forecasting>
Name: Tags, dtype: object