Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API for sharing data #2

Open
fhopp opened this issue Sep 26, 2019 · 4 comments
Open

API for sharing data #2

fhopp opened this issue Sep 26, 2019 · 4 comments
Assignees
Labels
enhancement A new feature or improvement
Projects

Comments

@fhopp
Copy link
Collaborator

fhopp commented Sep 26, 2019

We need to find a good API that lets us obtain sharing data of newspaper articles.

@fhopp fhopp added the enhancement A new feature or improvement label Sep 26, 2019
@fhopp
Copy link
Collaborator Author

fhopp commented Sep 27, 2019

Check out Facebook's Graph API: https://developers.facebook.com/docs/graph-api

@fhopp fhopp added this to Waiting in Sharing Sep 27, 2019
@fhopp fhopp moved this from Waiting to In progress in Sharing Sep 27, 2019
@fhopp
Copy link
Collaborator Author

fhopp commented Sep 27, 2019

@yibeichan and @fhopp discussed today that we can get very informative data via the Twitter API. For now, we are going to focus on "shares" on Twitter and later return to the idea of shares on Facebook. The single unit for scraping twitter data will still be a single URL. However, we will create two extra tables in cassandra:

  1. twitter_shares
  2. twitter_tweets

For (1), each row will be a unique URL and columns will consist of the unique number of users that mentioned this tweet and the total retweet counts, total likes, total comments of this URL.

For (2), each row will be the unique tweet that mentioned this URL along with metadata for that tweet such as the text, how many likes the tweet has gotten, how many replies, favorites etc.

Next step for @yibeichan is to think about how we can retrieve so many URLs. @musainayatmalik will help with implementing the "twitter scraping" pipeline in PySpark.

@yibeichan
Copy link

several ways to get historical twitter data (sorted)

  1. http://www.orgneat.com/ (free) it doesn’t allow to download tweets. We would be ok, we just need the retweet number. And if we get tweet ID, I can try other ways to get tweets.
  2. use multi-public twitter database, choose certain topics or combine them together, search news link among tweets, get share counts.
    database: https://www.docnow.io/catalog/ (free)
    Most of these database are topic/event/keyword-specified
  3. https://github.com/Jefferson-Henrique/GetOldTweets-python (free)
    I used it before, but it doesn’t contain the deleted data. We can give it a try.
  4. https://codecanyon.net/item/historical-tweets/22120633 ($14 purchase) It seems good, it’s an app.
  5. https://sifter.texifter.com/ this website has the complete, undeleted historical data of Twitter between 01/14/2014-09/29/2018, and can be cleaned by https://discovertext.com/ ( $24/month) However, we need contact Twitter to get approval to use the data
  6. https://www.trackmyhashtag.com/historical-twitter-data (pay) this one is based on hashtag to get historical data
  7. https://www.tweetbinder.com/payments/#/process-payment/historical (one-time purchase?) historical data, limitied to 140,000 tweets

@fhopp
Copy link
Collaborator Author

fhopp commented Oct 3, 2019

@yibeichan , can we close this now? We are using sharedcount.com to get facebook data and we will pay to get the twitter data? Can you open an issue for the Twitter data and comment the link to the company so I can get started on the application? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A new feature or improvement
Projects
Sharing
In progress
Development

No branches or pull requests

2 participants