Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running the parser with one alternative keyword provides much more accurate results than a list of keywords. #25

Closed
dkennedy778 opened this issue Dec 22, 2017 · 2 comments

Comments

@dkennedy778
Copy link

dkennedy778 commented Dec 22, 2017

The first time I ran the parser I used the following.

memes = findMemes("meme",Meme_secondary_keywords,i)

Where "findMemes" is simply the parser wrapped in a method, Meme_secondary_keywords is a list of meme related terms like "greg,trump, and kobe" and i is superfluous

The second time I ran the parser I switched to this method

for word in variety_keywords:
Secondarykeyword = []
Secondarykeyword.append(word)
memes = findMemes(search_keyword,Secondarykeyword,i)
i = i + 1

In the above, i just acts as a tag that is added onto each unique results folder.

The results from the second method contained FAR more "meme-like" content. Reviewing the results from the first method, I noticed that after the first 50 or so results I was getting very few memes.

For example, the word "Trump" is the 11th element of the list, so it will run after 1,100 searches have been done. The first method of running the method did not return any trump memes, all of the results were just generic pictures of trump. The second method returned over 95 pictures which I would classify as memes

This is likely an idiosyncrasy of google's image search algorithm. I would hypothesize that as you get further from the original results page the scope of the results widens dramatically. I do not know if this problem can be fixed on your end, but I think it would be useful for end users to be aware of the second method. If I had ran my parse with the first method my results would be unusable, but the second method gave me very good data

See the master class in my memeClassifier for an example of this behavior https://github.com/dkennedy778/memeClassifier

Searching for memes with the for loop provides very good results, but searching with just the method and keyword is practically useless

dkennedy778 added a commit to dkennedy778/google-images-download that referenced this issue Dec 24, 2017
Made some minor changes to make it easier to run the code as a loop. 

Running the parser from a main loop provides much better results when using alternative keywords. If you are parsing for a large set of alternative keywords, I recommend you structure your search as I detail in the below issue. See the master.py file in my memeClassifier project for the entire implementation 

hardikvasa#25
@hardikvasa
Copy link
Owner

hardikvasa commented Feb 18, 2018

Hi @dkennedy778 , thanks for opening up this issue.

Fast forwarding 30+ commits after you forked this code, the secondary keywords were removed and now has been added back to this repo. Now they are called suffix keywords because they suffix the main keyword.

I use loops to add suffix (secondary) keyword to the main keyword in the following way:

for every main keyword    
    for every secondary keyword
        search for main keyword + secondary keyword

If this is not what you meant, please feel free to correct me :)

@dkennedy778
Copy link
Author

You've got it exactly hardikvasa, thanks for adding this in!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants