# Exploring Hacker News Posts

In this project, we'll work with a data set of submissions to popular technology site [Hacker News](https://news.ycombinator.com/). Hacker News is a site where user-submitted stories (known as `"posts"`) are voted and commented upon.

We're specifically interested in posts whose titles begin with either **Ask HN** or **Show HN**. Users submit `Ask HN` posts to ask the Hacker News community a specific question

Likewise, users submit `Show HN` posts to show the Hacker News community a project, product, or just generally something interesting.

We'll compare these two types of posts (i.e Ask HN posts and Show HN Posts) to determine the following:

1. which post is most popular?

2. Do `Ask HN` or `Show HN` receive more comments on average? means on which we received more responase

3. Do posts created at a certain time receive more comments on average? means any specific time to post a news to get more comments

[modin](https://modin.readthedocs.io/en/latest/) modin is fast laibary and 73% function same as in pandas

In [None]:
!pip install modin[ray]
import numpy as np
import time
import modin.pandas as pd
import matplotlib.pyplot as plt
%%time
news=pd.read_csv("news_posts.csv")
news.head()



In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
%%time
news=pd.read_csv("news_posts.csv")
news.head()

In [None]:
news.count()

In [None]:
news.columns

In [None]:
news.info()

In [None]:
news.describe()

In [None]:
news.index

In [None]:
news.shape

**we  need to coonvert date time into date time formate for comparision**

# at the time of reading file we can convert date time
method 1

In [None]:
news=pd.read_csv("news_posts.csv",parse_dates=["created_at"])
news.head()

In [None]:
news.info()

# at the alter stage we can convert date time
method 2

In [None]:
news=pd.read_csv("news_posts.csv")
news["created_at"]=pd.to_datetime(news["created_at"])
news.head()

In [None]:
news.info()

In [None]:
news=news[['created_at','id', 'title', 'url', 'num_points', 'num_comments', 'author']]
news
#column interchange , bring date in first colums

**we need to make 2 or 3 data frame of askHN and showHN and others news to find which post is most popular**

* askhn and showhn comes first in title colum
* but we need to find hn in lower or upper case

In [None]:
news["title"].head(15)
#Ask HN , Show HN

In [None]:
ask_bool=news["title"].str.lower().str.startswith("ask hn")
show_bool=news["title"].str.lower().str.startswith("show hn")
#where find it show true, boolen index

In [None]:
ask_bool.head()

In [None]:
show_bool.head()

In [None]:
ask_post=news[ask_bool]
show_post=news[show_bool]

In [None]:
ask_post.head()

In [None]:
show_post.head()

In [None]:
other_news=news[~(ask_bool | show_bool)]
other_news.head()
#dalda use for not operator in pandas but in python we use just "not" 

# find most comments in ask hn post and show hn post

In [None]:
ask_post["num_comments"]

In [None]:
show_post["num_comments"]

In [None]:
ask_post_com=ask_post["num_comments"].mean()
show_post_com=show_post["num_comments"].mean()

In [None]:
print(f"ASK:{ask_post_com} , SHOW:{show_post_com}")
#on average askhn get most comments, means if we want to get most comment on post then we should need to place askHn in the start


# making a series to show the data 

In [None]:
avg_comment=pd.Series({"ASK":ask_post_com, "SHOW":show_post_com})
avg_comment

* we can make bar chart or pie chat
* bar chart comes into picture where we have catagory with absoult values
* pie chart comes into picture where we have % or proportion in data

In [None]:
avg_comment.plot.bar()
plt.show()

In [None]:
avg_comment.plot.barh(title="AVERAGE COMMENTS")
plt.show()

In [None]:
ask_post=ask_post.copy()
show_post=show_post.copy()
#make a copy of dataframe

# now we check in which hour we get more comment

In [None]:
ask_post.head()

In [None]:
ask_post["created_at"].dt.hour

In [None]:
ask_post["hours"]=ask_post["created_at"].dt.hour
ask_post.head()

In [None]:
ask_post.groupby("hours")["num_comments"].mean().sort_values(ascending=False)
#3pm USA time is best time, with askhn post

In [None]:
show_post.head()

In [None]:
show_post["created_at"].dt.hour

In [None]:
show_post["hour"]=show_post["created_at"].dt.hour
show_post.head()

In [None]:
show_post.groupby("hour")["num_comments"].mean().sort_values(ascending=False)

In [None]:
df=pd.DataFrame({'ask':ask_post.groupby("hours")["num_comments"].mean(),"show":show_post.groupby("hour")["num_comments"].mean()})
df.sort_values(["ask"],inplace=True)
df

In [None]:
df.plot.barh()
plt.show()

In [None]:
print({"Ask":ask_post.shape,"Show":show_post.shape})

# generate file

In [None]:
ask_post.to_csv("Askhn.csv",index=False) 
#if iindex true then it will generate by-default 1,2,3,4 ---- and if false then it will generate that is column

# making data frame

In [None]:
h_com=ask_post.groupby(["hours"])["num_comments"].mean().sort_values(ascending=False)
h_com

In [None]:
h_com_df=pd.DataFrame(h_com)
h_com_df

In [None]:
h_com_df.index.name=None


In [None]:
h_com_df

# print top five hours

In [None]:
for index,value in h_com_df.head().iterrows():
    print(index)

In [None]:
for index,value in h_com_df.head().iterrows():
    print(value)

In [None]:
for index,value in h_com_df.head().iterrows():
    print(index,value)

In [None]:
for index,value in h_com_df.head().iterrows():
    print(index,value.values)

In [None]:
for index,value in h_com_df.head().iterrows():
    print(index,value.values[0])

In [None]:
for index,value in h_com_df.head().iterrows():
    hours=index
    comment=value.values[0]
    print(f"There are {comment:.2f} comment at {hours}")

In [None]:
number=12783333338
f"My is {number:,}"

In [None]:
number=12783333338.9209
f"My is {number:,.2f}"

In [None]:
import pandas as pd 

# Creating a data frame 
df = pd.DataFrame([['Animal', 'Baby', 'Cat', 'Dog', 'Elephant', 'Frog', 'Gragor']]) 

# Itering over the data frame rows 
# using df.iterrows() 
itr = next(df.iterrows())[1]
itr 
