Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes #1252: Add Scraper for questions of QuoraScraper #1255

Merged
merged 1 commit into from Jul 8, 2017

Conversation

vibhcool
Copy link
Member

@vibhcool vibhcool commented Jun 16, 2017

Fixes Issue: #1252

Test link : http://35.184.148.24/api/quoraprofilescraper?query=Vibhor-Verma-5
http://35.184.148.24/api/quoraprofilescraper?query=asd

I have:

  • There is a corresponding issue for this pull request.
  • Mentioned the Issue number in the pull request commit message Fixes #<number> commit message
  • There is only strictly only one commit per issue.

For the reviewers

I have:

  • Reviewed this pull request by an authorized contributor.
  • The reviewer is assigned to the pull request.

@vibhcool
Copy link
Member Author

vibhcool commented Jun 20, 2017

@Orbiter @sudheesh001 @daminisatya @jig08 @mariobehling @singhpratyush @kavithaenair @SKrPl @hemantjadon @Achint08 @sarishinohara @djmgit

I am stuck at a wierd problem, In this method getData(), scraped data is returned to here https://github.com/loklak/loklak_server/pull/1255/files#diff-37e4b654b188b27f598f203ceb234782R83 , after this method is returns it's control.
this method returns no data. but it collects the data.
and json output is generated before scraped data is collected
what shall I do?
EDIT: fixed this, open for review

@vibhcool vibhcool changed the title [WIP] Fixes #1252: Add Scraper for questions of QuoraScraper Fixes #1252: Add Scraper for questions of QuoraScraper Jun 24, 2017
@vibhcool vibhcool force-pushed the 1230 branch 2 times, most recently from b1ec82a to c8b0b58 Compare July 1, 2017 12:57
@codecov-io
Copy link

codecov-io commented Jul 2, 2017

Codecov Report

Merging #1255 into development will decrease coverage by 0.09%.
The diff coverage is 2.25%.

Impacted file tree graph

@@               Coverage Diff               @@
##             development   #1255     +/-   ##
===============================================
- Coverage           9.08%   8.99%   -0.1%     
+ Complexity           397     396      -1     
===============================================
  Files                200     200             
  Lines              17359   17506    +147     
  Branches            3249    3267     +18     
===============================================
- Hits                1577    1574      -3     
- Misses             15475   15623    +148     
- Partials             307     309      +2
Impacted Files Coverage Δ Complexity Δ
src/org/loklak/objects/Timeline2.java 0% <0%> (ø) 0 <0> (ø) ⬇️
src/org/loklak/api/search/QuoraProfileScraper.java 0% <0%> (ø) 0 <0> (ø) ⬇️
src/org/loklak/harvester/BaseScraper.java 0% <0%> (ø) 0 <0> (ø) ⬇️
src/org/loklak/api/search/ConsoleService.java 0% <0%> (ø) 0 <0> (ø) ⬇️
src/org/loklak/harvester/Post.java 30.55% <18.18%> (-22.08%) 2 <1> (ø)
src/org/json/JSONTokener.java 42.75% <0%> (-1.38%) 22% <0%> (-1%)
src/org/loklak/harvester/TwitterScraper.java 52.73% <0%> (-0.48%) 35% <0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 12d7cda...39980af. Read the comment docs.

Copy link
Member

@daminisatya daminisatya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again the same issue. You are partially providing fixes to two or three issues together and creating multiple PRs. If u think each fix is different then it's the problem with you on dividing a huge chunk of problem into smaller issues.

@vibhcool
Copy link
Member Author

vibhcool commented Jul 4, 2017

@daminisatya please see PR description, :)

@vibhcool vibhcool force-pushed the 1230 branch 4 times, most recently from 083c479 to 3d4d4cf Compare July 4, 2017 10:55
@vibhcool
Copy link
Member Author

vibhcool commented Jul 4, 2017

@Achint08 @djmgit @hemantjadon @kavithaenair @SKrPl @singhpratyush @daminisatya : please review, I have rebased the PR

Copy link
Member

@singhpratyush singhpratyush left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments. Please take a look.

Also, please update the deployment link.

@@ -37,12 +40,12 @@

public class QuoraProfileScraper extends BaseScraper {

private final long serialVersionUID = -3398701925784347310L;
private final long serialVersionUID = -3398701925784347312L;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to change the serialVersionUID?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have done this to show that this class is not backward compatible.
https://stackoverflow.com/questions/285793/what-is-a-serialversionuid-and-why-should-i-use-it

dataThreads[i].join();
}
} catch(InterruptedException e) {
DAO.severe("Couldn't complete all threads");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we need to mention the scraper name and type variable here so that it is easy to debug.

@@ -266,14 +266,26 @@ private Timeline2 addPost(Post post) {
return this;
}

public void mergePost(Timeline2 list) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not using Timeline2#putAll here?

Copy link
Member Author

@vibhcool vibhcool Jul 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I skipped this method as still UserEntry is not in use. I will have to set up Post object for this.

@vibhcool vibhcool force-pushed the 1230 branch 3 times, most recently from 9749393 to b9f0811 Compare July 4, 2017 14:41
@vibhcool
Copy link
Member Author

vibhcool commented Jul 4, 2017

@singhpratyush , updated the test link, please see :)

Copy link
Member

@singhpratyush singhpratyush left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The search results that I see for http://35.187.215.147/api/quoraprofilescraper?query=ronaldo looks like this -

{
    "metadata": {
        "count": "11",
        "hits": 11
    },
    "peer_hash": "SfjbDL/f/9SH1ySjUSX8k+lZ2ocidIqVgxRpbwv1EA0=",
    "peer_hash_algorithm": "SHA-256",
    "session": {
        "identity": {
            "anonymous": true,
            "name": "10.12.1.1",
            "type": "host"
        }
    },
    "statuses": [
        {
            "bio": "",
            "feeds": {},
            "knows_about": [
                "Ronaldo",
                "Ronaldo",
                "Ronaldo",
                "Ronaldo",
                "Ronaldo"
            ],
            "profileImage": "https://qph.ec.quoracdn.net/main-thumb-86854754-50-apcitmmrwrdgxmcxzspksmqprxzawqzj.jpeg",
            "rss_feed_link": "https://www.quora.com/profile/ronaldo/rss",
            "search_url": "https://www.quora.com/profile/ronaldo",
            "timestamp": 1499189682253,
            "user": "Phillip Michael Mpalabule"
        },
        {
            "post_ques": "What are some famous gestures of respect in sports?",
            "post_type": "question",
            "post_url": "https://www.quora.com//What-are-some-famous-gestures-of-respect-in-sports",
            "search_url": "https://www.quora.com/search/?q=ronaldo&type=question",
            "timestamp": 1499189682080
        },
    ...
}

The users and questions are returned at the same level in JSON, which is not good as they are completely different entities. Somthing like this would be better -

{
    "metadata": {
        ...
        "count": {
            "user": 1,
            "question": 11
        },
        ...
    },
    "users": [
        ...
    ],
    "questions": [
        ...
    ]
}

@hemantjadon
Copy link

@vibhcool I agree with @singhpratyush 's idea, it will be great if the users and questions are at different level

@vibhcool
Copy link
Member Author

vibhcool commented Jul 5, 2017

@hemantjadon @singhpratyush , instead of arranging different posts according to different types, we can mention the post_type field and scraper_name field.
Though for QuoraScraper, @singhpratyush 's idea is nice. But this will not work when there are numerous scrapers working concurrently. what do you say?

{
    "metadata": {
        ...
        "count": {
            "user": 1,
            "question": 11
        },
        ...
    },
    "posts": [
        {
             ...
             "post_type"="question",
             "scraper_name"="quora"
        },
        {
                "post_type"="profile",
             "scraper_name"="quora"
        },
{
                "post_type"="some other type",
             "scraper_name"="any other scraper"
        }

    ],
    
}

@singhpratyush
Copy link
Member

@vibhcool: This is not a good practice. If the entities are different, we should not put them in same JSON level.

If we later happen to have lots of services to give as an answer, we can put them under different levels as follows -

{
    "metadata": {
        ...
    },
    "results": {
        "quora": {
            "users": [
                ...
            ],
            "questions": [
                ...
            ]
        },
        "twitter": {
            "statuses": [
                ...
            ]
        },
        ...
    }
}

Copy link
Member

@singhpratyush singhpratyush left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See previous comment.

Copy link

@hemantjadon hemantjadon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vibhcool I agree here with @singhpratyush they should be in different levels

@vibhcool vibhcool force-pushed the 1230 branch 2 times, most recently from e9c0a15 to 85561f8 Compare July 7, 2017 10:26
@vibhcool
Copy link
Member Author

vibhcool commented Jul 7, 2017

@daminisatya @singhpratyush @kavithaenair @SKrPl @hemantjadon @Achint08 @djmgit
made the suggested changes , please review

Changes:

  1. Now post object can also be used as wrapper object instead of being a interface.

not changed:

  1. I have not implemented metadata showing posts data as it would make this PR bigger and a bit more complicated to review. I would fix it in subsequent PRs

Copy link

@hemantjadon hemantjadon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

Copy link
Contributor

@SKrPl SKrPl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The topics in knows_about differ in what the actual user knows, for example:
The query was "Android", the user provided is Marc Bodnick:

screenshot from 2017-07-08 08-17-20

Things he actually knows about differ from Android (operating system), related screen shot:
screenshot from 2017-07-08 08-16-43

Also, the search_url param has value https://www.quora.com/profile/Android which redirects to https://www.quora.com/topic/Android-operating-system, so is it possible to get the final URL (after redirection)?

@mariobehling mariobehling merged commit 4a213c5 into loklak:development Jul 8, 2017
@vibhcool
Copy link
Member Author

vibhcool commented Jul 8, 2017

@SKrPl yes, I am creating issue for this, these aren't related to PR task :)

@vibhcool vibhcool deleted the 1230 branch July 8, 2017 12:00
vibhcool added a commit to vibhcool/loklak_server that referenced this pull request Jul 15, 2017
vibhcool added a commit to vibhcool/loklak_server that referenced this pull request Jul 19, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants