Import twitter archive #45

Merged
merged 7 commits into from Jul 2, 2013

Projects

None yet
@tralafiti
Contributor

Script to import json files from twitter's new archive feature. Only tested with my data so far. Feedback appreciated.

@badboy
badboy commented Jan 24, 2013

I did the mysql table change manually and then copied over the loadarchive.php and run it. Worked just fine. 👍

@alanmoo
alanmoo commented Feb 13, 2013

Worked great for me, too!

@gr4y
gr4y commented Feb 14, 2013

Worked fine for me, too!

@xsteadfastx

works fine here...

@tralafiti
Contributor

Looks like a bunch of people are getting there twitter archive finally. Thanks for the feedback.

@seefood
seefood commented Feb 17, 2013

42K tweets are taking quite a while to import, but it works very nicely, including full Unicode (which the csv export from Twitter lacks), so kudos!

@raamdev
raamdev commented Mar 25, 2013

Thanks for this @tralafiti!

To anyone importing a Twitter Tweet Archive into an existing Tweet Nest install for the first time, keep in mind that you'll need to clear your database of existing tweets (by running TRUNCATE on tn_tweets, tn_tweetwords, and tn_words), otherwise if you don't manually remove the .js files for the tweets already imported, running loadarchive.php will result in duplicate tweets.

This only happens the first time, as loadarchive.php keeps track of which .js files have already been imported. If you've never imported a Twitter Tweet Archive, you may have existing tweets that loadarchive.php doesn't know about.

Since Tweet Nest imports new Tweets automatically (which loadarchive.php won't know about), you'll need to be careful anytime you import a Twitter Tweet Archive into an existing Tweet Nest install, or you risk having duplicate tweets.

@tralafiti
Contributor

You're welcome @raamdev.

Did you run the upgrade.php? It marks the tweetid-column as unique to prevent the duplication of tweets on existing instances. If you did this indeed is a bug that should be fixed.

@raamdev
raamdev commented Mar 25, 2013

@tralafiti I did run upgrade.php but I got an error that said something like "Duplicate entry ‘44794062607360000’ for key ‘tweetid’". I proceeded to run loadarchive.php (from the command line) which seemed to work, but then I noticed I had duplicate entries.

@tralafiti
Contributor

@raamdev This means there already were some duplicated tweets in your database which led to upgrade.php being unable to make the unique alteration. Maybe the script should clean up these entries upon encountering this edge case or at least stop the process with an meaningful error message. Thanks for the hint.

@ali0une
ali0une commented May 5, 2013

you should have a look at https://github.com/amwhalen/archive-my-tweets which has a similar feature.

@gothick
gothick commented May 10, 2013

Thanks for the patch; great work, and just what I was looking for.

I found that while the import (into an existing Tweetnest install) worked beautifully, I subsequently didn't get any new tweets grabbed into my database by the normal tweetnest loadtweets.php. Looks like loadtweets.php finds the latest tweet by finding the latest tweetid using ORDER BY id DESC -- so if you import a bunch of older tweets, it gets confused as something that's not your latest tweet ends up with the highest id.

I worked around it by finding my latest tweet and re-inserting it as the latest thing in the database, then deleting the original entry for it, but I'd guess a better way would maybe be using the tweet's time or Twitter's tweetid (which I think is always an incrementing "integer", even though it's actually a string)? to find the latest tweet in loadtweets.php?

@tralafiti
Contributor

@gothick You sure you applied this commit, which is part of this branch, too? It should take care of the problem you ran into tralafiti@e9ed808

@gothick
gothick commented May 12, 2013

@tralafiti You know what? Turns out I'm an idiot. I'd applied that commit, but managed not to upload it to my server along with the other changes. Sorry to trouble you!

@graulund graulund merged commit 17407e7 into graulund:master Jul 2, 2013
@richardmtl

Hi! I first started by importing straight from Twitter, but it only grabbed my last 3200 tweets, so I tried to import the missing months from the downloaded tweet archives. When I clicked through to different months, Tweetnest only showed me the same tweets, my latest ones, no matter which month I clicked (although the counts differ per month and appear correct). I TRUNCATEd the approproate tables, and reimported EVERY month's "archive".js file from the very beginning of when I opened my Twitter account. All the counts are correct again, but still, clicking through to every month shows me only the same tweets, my most recent. Any idea what to do? Here's my tweetnest: http://tweets.richardarchambault.ca

Thanks!

@liberborn

Experienced this issue today. Was not upgrading to the latest tweet nest version for about a year.

Had to manually clean up the duplicates in the DB (PhpMyAdmin). Maybe it will be helpful for someone:

  • Find duplicate ids by running query:

SELECT tweetid
FROM tn_tweets
GROUP BY tweetid
HAVING count(tweetid) > 1;

  • Find duplicate rows by query:

select * from tn_tweets where tweetid in(
'121647570861830144',
'132989796304949248',
...);

  • Remove duplicate items.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment