Notebook from : http://fritzsluzala.pythonanywhere.com/topgit

In [None]:
#import sys
#!conda install --yes --prefix {sys.prefix} wordcloud

# Top Starred Open Source Projects on Github

Data Source: https://data.world/chasewillden/top-starred-open-source-projects-on-github

For data manipulation, I've used pandas. I find it very easy to use and save me lots of time on parsing and cleaning data.

In [None]:
import pandas as pd
df = pd.read_csv("in/TopStaredRepositories.csv", delimiter=",")
df = df.drop("Last Update Date", axis =1)
df["Number of Stars"] = df["Number of Stars"].str.replace("k","")
df["Number of Stars"] = df["Number of Stars"].apply(float)
df = df.fillna("")
df.head()

In [None]:
from matplotlib import pyplot as plt
import seaborn as sbn
%matplotlib inline

sbn.lmplot(x="index",y="Number of Stars",data=df.reset_index(),
           scatter_kws={'color': 'red'},fit_reg=False,
           size=10,aspect=2)
plt.xlim(-10,1000)
plt.ylim(0,300)
print() # Just to avoid the showing of plt.ylim(0,300) return. :D

In [None]:
tags = []
for line in df["Tags"]:
    if line != "" and tags != None:
        tags += line.replace(" ","").split(",")

Then, I counted it, to check if it was worth to build a wordcloud.

In [None]:
from collections import Counter

Counter(tags).most_common(10)

By the result above, seemed reasonable to build a wordcloud of it. Here is the result:

In [None]:
from wordcloud import WordCloud, random_color_func

wordcloud = WordCloud(background_color='black',
                     width=1000,
                     height=500,
                     ).generate(' '.join(tags))
wordcloud.recolor(color_func=random_color_func)
plt.figure(figsize=(15,8))
plt.imshow(wordcloud)
plt.axis('off')
plt.show()

After seeing the result above, I've got curious about who was the users that got more projects on the most starred list.

In [None]:
Counter(df['Username']).most_common(5)

The wordcloud was:

In [None]:
wordcloud = WordCloud(background_color='black',
                     width=1000,
                     height=500,
                     ).generate(' '.join(df['Username']))
wordcloud.recolor(color_func=random_color_func)
plt.figure(figsize=(15,8))
plt.imshow(wordcloud)
plt.axis('off')
plt.show()

As last visualization, I wanted to see which programming languages were the most used in those projects.

In [None]:
languages = Counter([l for l in df["Language"] if l != '' and l != None]).most_common()
languages[0:10]

BUUUUUUUUT the resultant wordcloud (below) was awful! D:

In [None]:
wordcloud = WordCloud(background_color='black',
                     width=1200,
                     height=800,
                     ).generate(' '.join([l for l in df["Language"] if l != '' and l != None]))
wordcloud.recolor(color_func=random_color_func)
plt.figure(figsize=(12,8))
plt.imshow(wordcloud)
plt.axis('off')
plt.show()

Sooo, I've decided that I would like to present it in another manner. So I've saved the Programming Languages count data to a CSV file and I've used the open source tool RAWGraphs to check other visualizations and choose "Circle Packing" as my chart. check out RAWGraphs at: http://rawgraphs.io/
<br/><br/>
The result of the Circle Packing chart is the one below. Much better! :D

In [None]:
with open("languages_count.csv","w") as languages_file:
    for language in languages:
        languages_file.write("{}\n".format(";".join(str(i) for i in language)))