# Popular Data Science Questions

In this guided project we will be exploring the Data Science Stack Exchange website and we will pinpoint the most popular content that is being talked about. 

First, we will query the DSSE database to retrieve a csv file about all posts in 2019 pertaining to the  CreationDate, Score, ViewCount, Tags, AnswerCount, FavoriteCount columns. The following is the SQL Quary we used.

SELECT CreationDate, Score, ViewCount, Tags, AnswerCount,FavoriteCount
FROM posts
WHERE CreationDate >= '2019-01-01' AND  CreationDate <= '2019-12-31'

### Importing Tools and Dataset

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [2]:
questions = pd.read_csv("2019_questions.csv", parse_dates=["CreationDate"])

### Exploring the Data

- The column 'FavoriteCount' contains a large number of null values in our dataset, about 75% are null values
- We can fill in these values with 0
- The datatypes correspond corretly to the columns after we run the .info() method
- We can remove the brackets from the Tags column and replace them with multi index tupules, nother option would be to remove the brackets and replace them with a delimiter such as a ",".


In [7]:
questions.head()

Unnamed: 0,Id,CreationDate,Score,ViewCount,Tags,AnswerCount,FavoriteCount
0,44419,2019-01-23 09:21:13,1,21,<machine-learning><data-mining>,0,
1,44420,2019-01-23 09:34:01,0,25,<machine-learning><regression><linear-regressi...,0,
2,44423,2019-01-23 09:58:41,2,1651,<python><time-series><forecast><forecasting>,0,
3,44427,2019-01-23 10:57:09,0,55,<machine-learning><scikit-learn><pca>,1,
4,44428,2019-01-23 11:02:15,0,19,<dataset><bigdata><data><speech-to-text>,0,


In [6]:

questions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8839 entries, 0 to 8838
Data columns (total 7 columns):
Id               8839 non-null int64
CreationDate     8839 non-null datetime64[ns]
Score            8839 non-null int64
ViewCount        8839 non-null int64
Tags             8839 non-null object
AnswerCount      8839 non-null int64
FavoriteCount    1407 non-null float64
dtypes: datetime64[ns](1), float64(1), int64(4), object(1)
memory usage: 483.5+ KB


### Cleaning the Dataset

In [10]:
# Replace NaN values with 0 inside the FavoriteCount column
questions["FavoriteCount"].fillna(0,inplace= True)

Unnamed: 0,Id,CreationDate,Score,ViewCount,Tags,AnswerCount,FavoriteCount
0,44419,2019-01-23 09:21:13,1,21,<machine-learning><data-mining>,0,0.0
1,44420,2019-01-23 09:34:01,0,25,<machine-learning><regression><linear-regressi...,0,0.0
2,44423,2019-01-23 09:58:41,2,1651,<python><time-series><forecast><forecasting>,0,0.0
3,44427,2019-01-23 10:57:09,0,55,<machine-learning><scikit-learn><pca>,1,0.0
4,44428,2019-01-23 11:02:15,0,19,<dataset><bigdata><data><speech-to-text>,0,0.0


In [14]:
# Change the float type to an int inside the FavoriteCount column
questions["FavoriteCount"]= questions["FavoriteCount"].astype(int)
questions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8839 entries, 0 to 8838
Data columns (total 7 columns):
Id               8839 non-null int64
CreationDate     8839 non-null datetime64[ns]
Score            8839 non-null int64
ViewCount        8839 non-null int64
Tags             8839 non-null object
AnswerCount      8839 non-null int64
FavoriteCount    8839 non-null int64
dtypes: datetime64[ns](1), int64(5), object(1)
memory usage: 483.5+ KB


Next, we remove the "<" and ">" from the Tags column and place a "," as a delimiter between each individual topic

In [17]:
questions["Tags"] = questions["Tags"].str.replace("><",",").str.replace("<","").str.replace(">","")
questions["Tags"].head()

0                         machine-learning,data-mining
1    machine-learning,regression,linear-regression,...
2              python,time-series,forecast,forecasting
3                    machine-learning,scikit-learn,pca
4                  dataset,bigdata,data,speech-to-text
Name: Tags, dtype: object