# Como obtener mas vistas en hacker news !!

En este proyecto exploraremos datos obtenidos de hacker news, estos datos son un amuestra representativa de 20,000 registros de un universo total de 300,000. Analizaremos que tipo de entradas tienen mas puntos si las realcionadas a preguntas 'Ask HN' o las relacionadas a mostrar algun tema 'Show HN'. 

In [1]:
#importamos nuestro archivo sin cabecero a la variable hn
from csv import reader
open_file = open('hacker_news.csv')
reader_file  = reader(open_file)
hn = list(reader_file)
# almacenamos el header en una variable separada
headers  = hn[:1]
# eliminamos el header
hn = hn[1:]


In [2]:
# hechamos un vistazo al cabecero
print(headers)

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']]


In [3]:
# hechamos un vistazo a los primeros 5 registros
print(hn[:5])

[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


## Limpiando nuestros datos

Ahora deseamos solo obtener las entradas que inicien con 'Ask HN' o 'Show HN' ya que nuestra lista tiene todo tipo de entradas y el análisis solo se enfoca en los dos tipos anteriores.

In [4]:
# separamos nuestros diferentes tipos de posts
ask_posts=[]
show_posts=[]
other_posts=[]

for post in hn:
    title = post[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(post)
    elif title.lower().startswith('show hn'):
        show_posts.append(post)
    else:
        other_posts.append(post)




Mostramos la cantidad de cada tipo de post obtenidos

In [5]:
print("Number of ask posts: {}".format(len(ask_posts)))
print("Number of show posts: {0}".format(len(show_posts)))
print("Number of other posts: {0}".format(len(other_posts)))

Number of ask posts: 1744
Number of show posts: 1162
Number of other posts: 17194


Exploremos los primeros 5 registros de los post 'ask hn'

In [6]:
print(ask_posts[:5])

[['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'], ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43'], ['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14'], ['12210105', 'Ask HN: Looking for Employee #3 How do I do it?', '', '1', '3', 'sph130', '8/2/2016 14:20'], ['10394168', 'Ask HN: Someone offered to buy my browser extension from me. What now?', '', '28', '17', 'roykolak', '10/15/2015 16:38']]


Ahora exploremos los primeros 5 registros de 'show hn'

In [7]:
print(show_posts[:5])

[['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03'], ['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46'], ['11590768', 'Show HN: Shanhu.io, a programming playground powered by e8vm', 'https://shanhu.io', '1', '1', 'h8liu', '4/28/2016 18:05'], ['12178806', 'Show HN: Webscope  Easy way for web developers to communicate with Clients', 'http://webscopeapp.com', '3', '3', 'fastbrick', '7/28/2016 7:11'], ['10872799', 'Show HN: GeoScreenshot  Easily test Geo-IP based web pages', 'https://www.geoscreenshot.com/', '1', '9', 'kpsychwave', '1/9/2016 20:45']]


## Ask HN vs Show HN

Ahora determinaremos que tipo de posts recibe mas comentarios en promedio.

### Ask HN comments

In [8]:
total_ask_comments = 0

for post in ask_posts:
    num_comments = int(post[4])
    total_ask_comments += num_comments
    
avg_ask_comments = total_ask_comments /len(ask_posts)
print("Total post to Ask HN: {}".format(len(ask_posts)))
print("Total comments to Ask HN: {}".format(total_ask_comments))
print("Average comments to Ask HN: {:2f}".format(avg_ask_comments))
    
    

Total post to Ask HN: 1744
Total comments to Ask HN: 24483
Average comments to Ask HN: 14.038417


### Show HN comments

In [9]:
total_show_comments = 0

for post in show_posts:
    num_comments = int(post[4])
    total_show_comments += num_comments

avg_show_comments = total_show_comments / len(show_posts)
print("Total posts to Show HN: {}".format(len(show_posts)))
print("Total comments to Show HN: {}".format(total_show_comments))
print("Average comments to Show HN: {:2f}".format(avg_show_comments))

Total posts to Show HN: 1162
Total comments to Show HN: 11988
Average comments to Show HN: 10.316695


### Ganador con mayor promedio de comentarios

Despues de obtener el promedio de cada uno de estos tipos de post podemos decir que los tipos de post 'Ask HN' tienen un promedio mayor de comentarios con un total de 14.03 % contra un 10.31 % de los post 'Show HN', pero ademas de esto podemos ver que la cantidad de comentarios en general para 'Ask posts' supera por un gran rango a su contrincante.

### Promedio de comentarios por hora para el ganador Ask post

Obtendremos un promedio por hora de comentarios, esto con el fin de saber cual es la hora mas indicada para postear y recibir mas comentarios ese día.

In [20]:
#Obtenemos el promedio por cada hora
import datetime as dt
result_list = []

for post in ask_posts:
    comment_post = []
    comment_post.append(post[6])
    comment_post.append(int(post[4]))
    result_list.append(comment_post)
    
counts_by_hour = {}
comments_by_hour = {}

for comment in result_list:
    date_comment = dt.datetime.strptime(comment[0],"%m/%d/%Y %H:%M")
    int_hr = dt.datetime.strftime(date_comment, "%H")
    if int_hr in counts_by_hour:
        counts_by_hour[int_hr] +=1
        comments_by_hour[int_hr]+= int(comment[1])
    else:
        counts_by_hour[int_hr] =1
        comments_by_hour[int_hr]= int(comment[1])

avg_by_hour = []
for hour_post in counts_by_hour:
    avg_by_hour.append([hour_post, comments_by_hour[hour_post]/counts_by_hour[hour_post]])
    
print(avg_by_hour)

[['06', 9.022727272727273], ['13', 14.741176470588234], ['15', 38.5948275862069], ['21', 16.009174311926607], ['02', 23.810344827586206], ['10', 13.440677966101696], ['05', 10.08695652173913], ['04', 7.170212765957447], ['18', 13.20183486238532], ['16', 16.796296296296298], ['03', 7.796296296296297], ['08', 10.25], ['23', 7.985294117647059], ['12', 9.41095890410959], ['14', 13.233644859813085], ['22', 6.746478873239437], ['20', 21.525], ['01', 11.383333333333333], ['09', 5.5777777777777775], ['00', 8.127272727272727], ['07', 7.852941176470588], ['17', 11.46], ['11', 11.051724137931034], ['19', 10.8]]


In [27]:
#Ordenamos de mayor a menor por promedio
swap_avg_by_hour = []
for post in avg_by_hour :
    swap_avg_by_hour.append([post[1],post[0]])

print(swap_avg_by_hour)

sorted_swap = sorted(swap_avg_by_hour, reverse = True)

    

[[9.022727272727273, '06'], [14.741176470588234, '13'], [38.5948275862069, '15'], [16.009174311926607, '21'], [23.810344827586206, '02'], [13.440677966101696, '10'], [10.08695652173913, '05'], [7.170212765957447, '04'], [13.20183486238532, '18'], [16.796296296296298, '16'], [7.796296296296297, '03'], [10.25, '08'], [7.985294117647059, '23'], [9.41095890410959, '12'], [13.233644859813085, '14'], [6.746478873239437, '22'], [21.525, '20'], [11.383333333333333, '01'], [5.5777777777777775, '09'], [8.127272727272727, '00'], [7.852941176470588, '07'], [11.46, '17'], [11.051724137931034, '11'], [10.8, '19']]


Este es el top 5 de horas para que tu post reciba mas comentarios.

In [28]:

print("\nTop 5 Horas para comentarios de Ask posts (24hrs:promedio) \n")

for hour_avg in sorted_swap[:5]:
    hour_day = dt.datetime.strptime(hour_avg[1],"%H")
    hour_day = dt.datetime.strftime(hour_day,"%H:00")
    average_post = float(hour_avg[0])
    print("{0}:{1:.2f}".format(hour_day, average_post))
    


Top 5 Horas para comentarios de Ask posts (24hrs:promedio) 

15:00:38.59
02:00:23.81
20:00:21.52
16:00:16.80
21:00:16.01
