# Deriving a random subset of tweets (JSON or CSV)

Let's say you're working with a file containing tweets, and you'd like to derive a random sample of tweets from that file.  For example, your set might be in chronological order, and you want just a random sample of 100 tweets from a mix of dates.

Note that the technique we describe here works for any type of data file with one observation per line, whether it's tweets or some other type of data, and whether the observations are in CSV format, JSON format, or some other text-based format with one observation per line.

Let's take a look at our sample data file:

In [26]:
!head -n 5 debatetweets.csv

created_at,twitter_id,screen_name,followers_count,friends_count,favorite_count/like_count,retweet_count,hashtags,mentions,in_reply_to_screen_name,twitter_url,text,is_retweet,is_quote,coordinates,url1,url1_expanded,url2,url2_expanded,media_url
2016-09-26 17:52:09+00:00,780464994479542272,tlajj5,21,18,0,0,"TrumpTrain, MAGA",realDonaldTrump,,http://twitter.com/tlajj5/status/780464994479542272,RT @realDonaldTrump: New national Bloomberg poll just released - thank you! Join the MOVEMENT: https://t.co/3KWOl2ibaW.  #TrumpTrain #MAGA…,Yes,No,,https://t.co/3KWOl2ibaW,http://www.DonaldJTrump.com,,,
2016-09-26 17:55:09+00:00,780465749794123777,Kosky98,87,82,0,0,Debates2016,HebertEtHalfred,,http://twitter.com/Kosky98/status/780465749794123777,RT @HebertEtHalfred: Quoi ? #Debates2016 ou quoi ?? https://t.co/NUYvbc0kYc,Yes,No,,,,,,http://pbs.twimg.com/media/CtTFQ2rWcAAKLb7.jpg
2016-09-26 17:56:34+00:00,780466104720236545,DeplorableTink,84,66,0,0,"sundaythoughts, ImWithHer, Debates2016, Crooked

It appears to have a header row, which we'll deal with shortly, followed by the data.  It looks like the tweets were written to the file in some sort of chronological order.  How long is it?

In [27]:
!wc -l debatetweets.csv

    1001 debatetweets.csv


It's 1,001 rows, so that's 1 header row followed by 1,000 observations.

The critical tool for accomplishing this cleverly is the `shuf`/`gshuf` shell command.  This command, part of the GNU Coreutils library, shows up as `shuf` in a bash shell, or as `gshuf` in Mac OSX (installed via `brew install coreutils`) and as `gshuf` in the IPython shell like we have here.  Let's see what it can do:

In [28]:
!gshuf --help

Usage: gshuf [OPTION]... [FILE]
  or:  gshuf -e [OPTION]... [ARG]...
  or:  gshuf -i LO-HI [OPTION]...
Write a random permutation of the input lines to standard output.

With no FILE, or when FILE is -, read standard input.

Mandatory arguments to long options are mandatory for short options too.
  -e, --echo                treat each ARG as an input line
  -i, --input-range=LO-HI   treat each number LO through HI as an input line
  -n, --head-count=COUNT    output at most COUNT lines
  -o, --output=FILE         write result to FILE instead of standard output
      --random-source=FILE  get random bytes from FILE
  -r, --repeat              output lines can be repeated
  -z, --zero-terminated     line delimiter is NUL, not newline
      --help     display this help and exit
      --version  output version information and exit

GNU coreutils online help: <http://www.gnu.org/software/coreutils/>
Full documentation at: <http://www.gnu.org/software/coreutils/shuf>
or av

It looks like this is something we can use!  First we need to peel off the header row; we'll put it back later:

In [29]:
!head -n 1 debatetweets.csv > header.csv

Now for the main event.  We'll pipe ( `|` ) everything *except* the first line of debatetweets.csv to `gshuf`, and we'll take advantage of the `-n` option to request only 100 lines:

In [30]:
!tail -n +2 debatetweets.csv | gshuf -n 100 > only100tweets.csv

Now let's quickly size up what we got out:

In [31]:
!head -n 5 only100tweets.csv

2016-09-27 08:58:10+00:00,780692999651074048,beingmissdaisy,1278,1007,0,0,debatenight,Elijahkyama,,http://twitter.com/beingmissdaisy/status/780692999651074048,RT @Elijahkyama: The person who did this will be haunted for nothing https://t.co/69C2KHMAk3 #debatenight,Yes,No,,,,,,http://pbs.twimg.com/media/CtUjRLIVIAAK5VT.jpg
2016-09-27 06:59:53+00:00,780663232537198592,PlatoSays,2141,993,0,0,"debatenight, Debates2016",Agent4Trump,,http://twitter.com/PlatoSays/status/780663232537198592,RT @Agent4Trump: Stop Lying Hillary Fact Check Trolls.... ICE union endorses Trump https://t.co/EaVN7m9rxN #debatenight #Debates2016,Yes,No,,https://t.co/EaVN7m9rxN,http://politi.co/2cWt8Wp,,,
2016-09-27 09:43:24+00:00,780704382048428032,Alex70CDA,456,375,0,0,"Narcos, Debate",la_maquina,,http://twitter.com/Alex70CDA/status/780704382048428032,RT @la_maquina: That one time a rich ranting lunatic thought he would be President. #Narcos / #Debate https://t.co/Xp1wCcqYFA,Yes,No,,,,,,http://pbs.twimg.com/media/

That looks like the random sample we expected.  It doesn't appear to be in any chronological order.

In [32]:
!wc -l only100tweets.csv

     100 only100tweets.csv


Last but not least, we do need to reattach the header row (not applicable in the case of a line-oriented JSON file):

In [33]:
!cat header.csv only100tweets.csv > debatetweets-100sample.csv

In [34]:
!head -n 5 debatetweets-100sample.csv

created_at,twitter_id,screen_name,followers_count,friends_count,favorite_count/like_count,retweet_count,hashtags,mentions,in_reply_to_screen_name,twitter_url,text,is_retweet,is_quote,coordinates,url1,url1_expanded,url2,url2_expanded,media_url
2016-09-27 08:58:10+00:00,780692999651074048,beingmissdaisy,1278,1007,0,0,debatenight,Elijahkyama,,http://twitter.com/beingmissdaisy/status/780692999651074048,RT @Elijahkyama: The person who did this will be haunted for nothing https://t.co/69C2KHMAk3 #debatenight,Yes,No,,,,,,http://pbs.twimg.com/media/CtUjRLIVIAAK5VT.jpg
2016-09-27 06:59:53+00:00,780663232537198592,PlatoSays,2141,993,0,0,"debatenight, Debates2016",Agent4Trump,,http://twitter.com/PlatoSays/status/780663232537198592,RT @Agent4Trump: Stop Lying Hillary Fact Check Trolls.... ICE union endorses Trump https://t.co/EaVN7m9rxN #debatenight #Debates2016,Yes,No,,https://t.co/EaVN7m9rxN,http://politi.co/2cWt8Wp,,,
2016-09-27 09:43:24+00:00,780704382048428032,Alex70CDA,456,375,0,0,"Nar

Done!