New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WordClouds of comments and friends #21

Merged
merged 6 commits into from Jul 13, 2018

Conversation

Projects
None yet
6 participants
@parimatrix
Collaborator

parimatrix commented Jun 28, 2018

Implements two features as of now. Generates two wordclouds.
One of the most common words used by the user in comments. Second one is of the most tagged friends by the user. Please have a look at the images added in the images section.

Closes #5

@kaustubhhiware

kaustubhhiware requested changes Jun 28, 2018 edited

Kaafi thankful person.
Yeah, so for the first wordcloud itself, I think it would be better to account for posts AND comments.
The second wordcloud seems fine.

Could you change the colour scheme?
Seems too dark. The alice example might be better.

with open(fname) as f:
base_data = json.load(f)
data = base_data["comments"]

This comment has been minimized.

@techytushar

techytushar Jun 28, 2018

Collaborator

it would be better to check if the user has any data or not in the comments file (i.e. if he has any comments or not) and display a message (instead of an error) if the user has no comments.

This comment has been minimized.

@parimatrix

parimatrix Jun 28, 2018

Collaborator

True. Will do that.

@parimatrix

This comment has been minimized.

Collaborator

parimatrix commented Jun 28, 2018

Haha. My fb comments mostly include thanks to birthday posts :P
Will incorporate the changes soon.

@xprilion

Please modify to cover these points as mentioned in the issue -

  • What hashtags do you frequently use?
  • What language do you often use?

Also, add the wordcloud requirement in the requirements.txt file.

You might want to skip 2-3 letter words while making the wordcloud and also words which are derived from the same word, for example "Thank" and "Thanks". For my wordcloud, I would prefer seeing more unique and unrelated words rather than different forms of the same word over and over again.

@kaustubhhiware

This comment has been minimized.

Owner

kaustubhhiware commented Jun 28, 2018

also words which are derived from the same word, for example "Thank" and "Thanks"

I'm not sure if this is needed, but we could use word2vec's word similarity in this case.

Say at some point, we have a populated list, when a new word is encountered, we check if it's similarity score is above a certain threshold (can be mutually agreed upon) with any existing word in the list, if yes those two words are treated as the same word. We maintain one dictionary for keeping count of words, and another for similar words encountered so far.
This way we can avoid similar words being counted as distinct.
Again, this might take some time, so will not be enforced upon.

@kaustubhhiware

This comment has been minimized.

Owner

kaustubhhiware commented Jun 28, 2018

Upon closer inspection, it seems the second wordcloud (tagged friends) might be incorrect. It contains words like FB, Robot, Mummy, Yes, Unlucky, Machine which are unlikely names / surnames.

You haven't extracted your friends names at any point, due to which a lot of English names are being visualised. Have a look, the second wordcloud needs to be rectified.

@parimatrix

This comment has been minimized.

Collaborator

parimatrix commented Jun 28, 2018

@kaustubhhiware Yes, I totally forgot to include only friends wala thing. Done that now. Now the plot consists only friends' names.

@xprilion Yes, I have not implemented the hashtag and language feature abhi. Was working on wordclouds.

Lastly, regarding repeating words, which is a classic problem of NLP. Yeah, it's true. WordCloud repeats kind of same words too many times. So I found two partials solutions for it.

  1. using ' collocations=False ' while calling WordCloud
  2. using NLTK Stemmer to convert similar words like thank, thanks, thanking, etc to one.

This produced better results. Have added the new plots.

@techytushar

This comment has been minimized.

Collaborator

techytushar commented Jun 28, 2018

nice work @parimatrix the wordcloud looks very nice and more accurate now 👍

@xprilion

This comment has been minimized.

Collaborator

xprilion commented Jun 28, 2018

I see that you've tried the stemming on the words, which is what I wanted you to do! That's the established way of handling nearby words in NLP :)

But apparently you've missed converting them back to their original form after the stemming. So the wordcloud is displaying the stemmed words. You should consider converting back the stemmed word into any of the regular forms of the word, preferably the singular present tense. An easy solution would be to maintain a dict of all words that stem to the same word, with the stemmed word being the dict key.

As @kaustubhhiware says, this is not enforced upon. It is a subtle difference.

As for the 2nd wordcloud, there seems to be a weird issue of only the surnames of some people appearing in the wordcloud. This maybe because the wordcloud generation library takes space separated list of words. I would suggest only sending in the first names of the people. Watch out for names that begin with "Md" or "Kumar", there you would want the second name.

flist = []
fname = loc+'/friends/friends.json'
if not os.path.isfile(fname):
print("The file your_posts.json is not present at the entered location.")

This comment has been minimized.

@kaustubhhiware

kaustubhhiware Jun 29, 2018

Owner

Please rectify the error message.
Should say "friends.json" instead of "your_posts.json".
Trivial issue, but let's do it right.

This comment has been minimized.

@parimatrix

parimatrix Jun 29, 2018

Collaborator

Oh yes! Will change it.

@kaustubhhiware

This comment has been minimized.

Owner

kaustubhhiware commented Jun 29, 2018

For first wordcloud, there's a big "Thank" in the middle, and yet another "Thank" in the top right corner above "Happy". Seems something needs to be fixed 😅

The second wordcloud looks good. No suggestions on that. Due to this color scheme, I'm able to see even the smallest words. Good choice 👍
For once I thought you're filtering the top 10 tagged friends, but the code says you're plotting all the friends. Which means in reality, we don't tag a lot of people in comments. If you think this is an interesting enough observation, you can add it in the Observations section in the README.

@techytushar

This comment has been minimized.

Collaborator

techytushar commented Jun 29, 2018

Hey @parimatrix , just a tip, don't try to push many commits, make a new commit only when there is some good amount of change from the last commit. To add your changes to the previous commit you can use git commit --amend or you can squash several commits into one. Also try to write descriptive commit messages, so that if someone looks in the logs he can understand what work was done in that commit. All this makes the repository logs clean and understandable, which is helpful when the project grows. 🙂
You can refer to this: https://chris.beams.io/posts/git-commit/

@parimatrix

This comment has been minimized.

Collaborator

parimatrix commented Jun 29, 2018

@techytushar Yeah, I understand. Sorry for this. Will take care in future. 😞

@kaustubhhiware

This comment has been minimized.

Owner

kaustubhhiware commented Jun 30, 2018

@parimatrix just let us know when you're ready for a review. Since this might take some time, I'm adding the in-progress label. Feel free to remove it when you're done.

The couple of things left to complete before a review are:

  • Hashtags
  • Language
  • Remove the worcloud_comments_2.0.png
  • Add images in the README.

Again, this is a big issue, so take your time. I'm not trying to rush you. These are just some practices I wish to set up for future PR's.

@farhaanbukhsh

amazing work, I might be nitpicking but these are some of the good points that you can incorporate.

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
ps = PorterStemmer()

This comment has been minimized.

@farhaanbukhsh

farhaanbukhsh Jul 1, 2018

Collaborator

If you are using something globally and it's a constant the convention is to have it in all caps.

ps = PorterStemmer()
def wordcloud():

This comment has been minimized.

@farhaanbukhsh

farhaanbukhsh Jul 1, 2018

Collaborator

I would suggest to have a docstring attach to the function.

base_data = json.load(f)
final_text = ""
final_comments = ""

This comment has been minimized.

@farhaanbukhsh

farhaanbukhsh Jul 1, 2018

Collaborator

I would suggest to have a or use None this makes your code more readable.

for ele in data:
if 'data' in ele:
ctext = ele["data"][0]["comment"]["comment"]

This comment has been minimized.

@farhaanbukhsh

farhaanbukhsh Jul 1, 2018

Collaborator

This line as far as I see can be move up, you are keeping ctext constant right so why should it be redeclared every time in the loop?

This comment has been minimized.

@parimatrix

parimatrix Jul 4, 2018

Collaborator

Actually, ctext is assigned again and again in the loop to the particular key's value for each object. Sorry, I could not understand what you were referring to.

This comment has been minimized.

@farhaanbukhsh

farhaanbukhsh Jul 12, 2018

Collaborator

This looks int to me 👍 I was thinking the ctext value is used once.

b = detect(ctext)
#if b not in languages:
languages.append(b)
except:

This comment has been minimized.

@farhaanbukhsh

farhaanbukhsh Jul 1, 2018

Collaborator

any exception? should we narrow it down to some specific?

print("Your Most Common Language: ")
if max(languages,key=languages.count) =='en':
print('English')

This comment has been minimized.

@farhaanbukhsh

farhaanbukhsh Jul 1, 2018

Collaborator

why biased towards English, I feel we should just print the language 😄

print(max(languages,key=languages.count))
if final_text != "":
mask = np.array(Image.open("images/mymask.png"))

This comment has been minimized.

@farhaanbukhsh

farhaanbukhsh Jul 1, 2018

Collaborator

again constant file location move it above

else:
print("No Comments and Posts Text Found")
if __name__ == '__main__':

This comment has been minimized.

@farhaanbukhsh

farhaanbukhsh Jul 1, 2018

Collaborator

have empty line at the end of the file

@parimatrix

This comment has been minimized.

Collaborator

parimatrix commented Jul 1, 2018

@farhaanbukhsh THANKS so very much. I did not know at all about some of the points. Really glad to know these.
Also, @kaustubhhiware , I have been working on this today. Have made some progress. But, I would not be able to work on this for the next two days. I would resume soon after that. Would that be fine or is there a deadline I will be missing ?
Thanks!

@kaustubhhiware

This comment has been minimized.

Owner

kaustubhhiware commented Jul 2, 2018

Since you've put in some work, it's alright.
Just make sure you're in touch regardless of whether you can work or not.

@parimatrix

This comment has been minimized.

Collaborator

parimatrix commented Jul 4, 2018

Hi.
I have implemented most of the points mentioned by @farhaanbukhsh. The ones in which I had a doubt I have replied inline.
@kaustubhhiware Out the 4 points you mentioned, last 3 are done. For hashtags, I am facing a bit of an issue. Actually, in the data, the symbol '#' appears many many times in between links, image data, color codes or maybe even emoticon data. So for extracting hashtags seems difficult. If you could suggest something, it would be nice. Or maybe we could drop the hashtags if that's fine with you.
Thanks

@kaustubhhiware

This comment has been minimized.

Owner

kaustubhhiware commented Jul 6, 2018

Hmm, ok. What are the kind of '#' are you seeing?

Could you enlist some of them, in each form you have mentioned?

Let's try figuring out if hastags can be worked upon, if not we'll drop it.

@parimatrix

This comment has been minimized.

Collaborator

parimatrix commented Jul 6, 2018

Okay, here are a few examples:

.../flash/swflash.cab#version=8,0,0,0...
.../photos?ref=ts#!/photo.php?fbid=92...
...param name="bgcolor" value="#ffffff"...

@xprilion

This comment has been minimized.

Collaborator

xprilion commented Jul 6, 2018

@parimatrix I am not sure which file you are using.

My your_posts.json and comments.json look fine enough to work with on the hashtags used while commenting and making posts. I ran a quick '#' search in the files and found that in 9/10 cases it was some hashtag in my post/comment. None matched the samples you provided above.

@parimatrix

This comment has been minimized.

Collaborator

parimatrix commented Jul 7, 2018

I used the exact same files, your_posts.json, and comments.json
I hardly use hashtags, so yeah that is a reason for less valid ones, but in any case, the invalid ones would be present in a random case, I guess.
Handling these invalid ones is what we would have to take into account. Can't ignore them.

@xprilion

This comment has been minimized.

Collaborator

xprilion commented Jul 7, 2018

can you work something by discarding the posts with the invalid hashtags? or is the volume too high? :o

@parimatrix

This comment has been minimized.

Collaborator

parimatrix commented Jul 7, 2018

In my case, 7 were invalid out of 8. My case could be an outlier though. :p

@techytushar

This comment has been minimized.

Collaborator

techytushar commented Jul 7, 2018

in my case I found 6 #, only in the your_posts file and all of them were invalid. All of them were in the url links that I had included in my posts. One possibility can be to only use those hashtags which are used with some valid English words but most of the hashtags contains some kind of abbreviations, so that would become invalid. So its quite complicated.

@parimatrix

This comment has been minimized.

Collaborator

parimatrix commented Jul 8, 2018

@techytushar Yeah. It's true. People would rarely use proper English words. Yet another difficulty would be that even if there are valid English words, there would be no spaces between them. So searching for the string in a dictionary of words would not work too.
@kaustubhhiware So I think, hashtags would have to be skipped, if possible.

@xprilion

This comment has been minimized.

Collaborator

xprilion commented Jul 8, 2018

everything else is fine by me..approving changes :)

@kaustubhhiware

This comment has been minimized.

Owner

kaustubhhiware commented Jul 9, 2018

Seeing the timespan of this PR (and accounting for your data), it seems it's best to leave hashtags out of this PR.
Let someone else handle hashtags, or else I think I have some pretty hashtag patterns as well.
Cool, approved from my side as well.
Letting other mentors have a look before merging this in.

@kaustubhhiware

This comment has been minimized.

Owner

kaustubhhiware commented Jul 10, 2018

@farhaanbukhsh are the changed you suggested incorporated?
Need to merge this in.

@farhaanbukhsh

This comment has been minimized.

Collaborator

farhaanbukhsh commented Jul 12, 2018

Hey sorry for the delay I was a bit caught up 😄

@farhaanbukhsh

LGTM 👍 Good work 😄

@roopalJazz roopalJazz merged commit f8c16ce into kaustubhhiware:master Jul 13, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment