# Twitter data cleaning and reshaping

Demo for EMSE 6992, "Machine Learning for Analytics" on 1/30/19. 

This notebook shows examples of using command-line tools for working with Twitter data. 

For this exercise, we're using a dataset with 2,000 tweets collected 01/01/2019 from the Twitter sample stream API. Filename: sample.json

How many tweets are in this file? Look at the first tweet.

In [1]:
!wc -l sample.json

    2000 sample.json


In [2]:
!head -1 sample.json

{"retweeted": false, "is_quote_status": false, "retweet_count": 0, "text": "Happy birthday to a real one\ud83d\udda4 https://t.co/1o4FxnHgXg", "source": "<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter for iPhone</a>", "favorited": false, "extended_entities": {"media": [{"id": 1080174377138823175, "sizes": {"thumb": {"h": 150, "resize": "crop", "w": 150}, "large": {"h": 2047, "resize": "fit", "w": 1845}, "small": {"h": 680, "resize": "fit", "w": 613}, "medium": {"h": 1200, "resize": "fit", "w": 1082}}, "display_url": "pic.twitter.com/1o4FxnHgXg", "indices": [30, 53], "media_url_https": "https://pbs.twimg.com/media/Dv2M33zWoAc83BZ.jpg", "id_str": "1080174377138823175", "media_url": "http://pbs.twimg.com/media/Dv2M33zWoAc83BZ.jpg", "type": "photo", "url": "https://t.co/1o4FxnHgXg", "expanded_url": "https://twitter.com/emxdube/status/1080174384776732673/photo/1"}]}, "reply_count": 0, "in_reply_to_status_id_str": null, "in_reply_to_screen_name": null, "id_str": "108

View the first tweet in the file using jq.

In [3]:
!head -1 sample.json | jq '.'

[1;39m{
  [0m[34;1m"retweeted"[0m[1;39m: [0m[0;39mfalse[0m[1;39m,
  [0m[34;1m"is_quote_status"[0m[1;39m: [0m[0;39mfalse[0m[1;39m,
  [0m[34;1m"retweet_count"[0m[1;39m: [0m[0;39m0[0m[1;39m,
  [0m[34;1m"text"[0m[1;39m: [0m[0;32m"Happy birthday to a real one🖤 https://t.co/1o4FxnHgXg"[0m[1;39m,
  [0m[34;1m"source"[0m[1;39m: [0m[0;32m"<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter for iPhone</a>"[0m[1;39m,
  [0m[34;1m"favorited"[0m[1;39m: [0m[0;39mfalse[0m[1;39m,
  [0m[34;1m"extended_entities"[0m[1;39m: [0m[1;39m{
    [0m[34;1m"media"[0m[1;39m: [0m[1;39m[
      [1;39m{
        [0m[34;1m"id"[0m[1;39m: [0m[0;39m1080174377138823200[0m[1;39m,
        [0m[34;1m"sizes"[0m[1;39m: [0m[1;39m{
          [0m[34;1m"thumb"[0m[1;39m: [0m[1;39m{
            [0m[34;1m"h"[0m[1;39m: [0m[0;39m150[0m[1;39m,
            [0m[34;1m"resize"[0m[1;39m: [0m[0;32m"crop"[0m[1;39m,
   

Choose multiple fields to extract from all tweets, using jq and the -c flag for compact output. 

In [4]:
!cat sample.json | jq -c '[.id_str, .text, .user.screen_name, .created_at]'

[1;39m[[0;32m"1080174384776732673"[0m[1;39m,[0;32m"Happy birthday to a real one🖤 https://t.co/1o4FxnHgXg"[0m[1;39m,[0;32m"emxdube"[0m[1;39m,[0;32m"Tue Jan 01 18:50:16 +0000 2019"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"1080174388975226881"[0m[1;39m,[0;32m"RT @aixagroetzner: Todos queremos que nos quieran en voz alta"[0m[1;39m,[0;32m"areanax"[0m[1;39m,[0;32m"Tue Jan 01 18:50:17 +0000 2019"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"1080174388975157250"[0m[1;39m,[0;32m"RT @PCaterianoB: Cateriano afirma que el fujiaprismo se enfoca en “salvar el pellejo de ‘AG’ y ‘señora K’” | Canal N https://t.co/wEL1dpvsy1"[0m[1;39m,[0;32m"fokito"[0m[1;39m,[0;32m"Tue Jan 01 18:50:17 +0000 2019"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"1080174388987809792"[0m[1;39m,[0;32m"RT @QueridoJeito: “Ou você arregaça as mangas e luta pelo que tem, ou você decide que está cansado e você desiste.” \n— This Is Us"[0m[1;39m,[0;32m"ig_sg15"[0m[1;39m,[0;32m"Tue Jan 01 18:50:17 +0000 201

[1;39m[[0;32m"1080174393177837568"[0m[1;39m,[0;32m"RT @ya_deviantka: @BananaForYazzy Все уже начали трезветь"[0m[1;39m,[0;32m"BananaForYazzy"[0m[1;39m,[0;32m"Tue Jan 01 18:50:18 +0000 2019"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"1080174393169334272"[0m[1;39m,[0;32m"RT @thejellyest: คือมันลงตัวไปหมด เคมีของดบด. กับกดว. คือดีทุกวัย ตอนเด็กคือsummer love ใสๆ วัยกลางคนคือศัตรูที่รัก วัยชราคือต่างคนต่างรักร…"[0m[1;39m,[0;32m"daisystarpie"[0m[1;39m,[0;32m"Tue Jan 01 18:50:18 +0000 2019"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"1080174393181974528"[0m[1;39m,[0;32m"RT @TavoSantos10: 🎞️ https://t.co/LXhGz2EHE0\nhttps://t.co/DPaCd45bZ4\n\n🌐 https://t.co/3emGs0fOs4 https://t.co/u4GW2vBn7R"[0m[1;39m,[0;32m"TavoSantos10"[0m[1;39m,[0;32m"Tue Jan 01 18:50:18 +0000 2019"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"1080174393169494017"[0m[1;39m,[0;32m"@ActusNonStop La dinde"[0m[1;39m,[0;32m"cataclem"[0m[1;39m,[0;32m"Tue Jan 01 18:50:18 +0000 2019"[0m[1;39m[1;39m][0

[1;39m[[0;32m"1080174426732154880"[0m[1;39m,[0;32m"RT @boglesbian: 2019 should be 20niceteen because first of all we need to be nicer to each other most of the time and second of all we are…"[0m[1;39m,[0;32m"mickey_jae"[0m[1;39m,[0;32m"Tue Jan 01 18:50:26 +0000 2019"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"1080174426753323008"[0m[1;39m,[0;32m"Millie Bobby Brown talking about the gays https://t.co/kNkm5u3Uge"[0m[1;39m,[0;32m"DelectableThot"[0m[1;39m,[0;32m"Tue Jan 01 18:50:26 +0000 2019"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"1080174426753138688"[0m[1;39m,[0;32m"RT @lovrgirlx: KEEP YOUR BAD ENERGY AWAY FROM ME 2019"[0m[1;39m,[0;32m"_pdezzy"[0m[1;39m,[0;32m"Tue Jan 01 18:50:26 +0000 2019"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"1080174426740613121"[0m[1;39m,[0;32m"告白って映画が想像してたのと全然違ってびっくりしたる"[0m[1;39m,[0;32m"_kohyama"[0m[1;39m,[0;32m"Tue Jan 01 18:50:26 +0000 2019"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"1080174426753302533"[0m[1;39m,[0;32m"https:

[1;39m[[0;32m"1080174443522068481"[0m[1;39m,[0;32m"T-Pain - Digiwaxx Media - Boo'd Up (T-Mix) (Dirty) is streaming on da Front Porch Radio 📡 https://t.co/c8Li1lC78N… https://t.co/i6RxGuaziH"[0m[1;39m,[0;32m"daFrontPorch"[0m[1;39m,[0;32m"Tue Jan 01 18:50:30 +0000 2019"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"1080174443517931522"[0m[1;39m,[0;32m"Tabi canım tabi ticaret usul borçlar öğrenci hepsi. Kütüphanede geçiyor olay zaten. Ticaret usule usul borçlara bor… https://t.co/2riRilguzI"[0m[1;39m,[0;32m"ErtugrulKahram2"[0m[1;39m,[0;32m"Tue Jan 01 18:50:30 +0000 2019"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"1080174447716376577"[0m[1;39m,[0;32m"Y si x shunsha compre un juego de dormitorio muy grande y ahora tengo que venderlo casi nuevo.. Acoliten"[0m[1;39m,[0;32m"friega_again2"[0m[1;39m,[0;32m"Tue Jan 01 18:50:31 +0000 2019"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"1080174447703801856"[0m[1;39m,[0;32m"RT @KK_Khalid_H: لعنت ہو ایسے نظام و قانون پر، دریا برد ہو ا

[1;39m[[0;32m"1080174477080694784"[0m[1;39m,[0;32m"RT @ahhrq8: \"لا يوجد اشخاص كئيبون...يوجد اشخاص في اماكن لا ينتمون لها روحا وعقلا\""[0m[1;39m,[0;32m"a29_2_"[0m[1;39m,[0;32m"Tue Jan 01 18:50:38 +0000 2019"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"1080174477072310272"[0m[1;39m,[0;32m"#WWELive #HolidayTour memories #Chicago!!! #WWE #RAW #WWERaw @ United Center https://t.co/0Xj4epFkZ1"[0m[1;39m,[0;32m"DashaFuentesWWE"[0m[1;39m,[0;32m"Tue Jan 01 18:50:38 +0000 2019"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"1080174477076586496"[0m[1;39m,[0;32m"RT @DearMeNo: @KAYDM49 @GaryLineker 9000 to biggest petition ever \nhttps://t.co/y7UHAaPG35 https://t.co/2ltTLBPVhC"[0m[1;39m,[0;32m"Vici1609"[0m[1;39m,[0;32m"Tue Jan 01 18:50:38 +0000 2019"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"1080174477055385600"[0m[1;39m,[0;32m"RT @penny_lex: The best ของผลิตโชค คือ นู้ชชชชชชช ฮืออออ ขอบคุณที่มองเห็น ขอบคุณที่ทำให้รุว่าเราทำเพื่ออะรัย เรารักคนไม่ผิดจิงๆ #เป๊กผลิตโช…"[0m[1;39m,[

[1;39m[[0;32m"1080174506444902400"[0m[1;39m,[0;32m"RT @wnxxjk: 🐰cut🖤\n#정국 #JUNGKOOK @BTS_twt \nhttps://t.co/ZLMfU0p7Kz https://t.co/F2dtmUHL3b"[0m[1;39m,[0;32m"endec27"[0m[1;39m,[0;32m"Tue Jan 01 18:50:45 +0000 2019"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"1080174506445099008"[0m[1;39m,[0;32m"@kttouch Anyways"[0m[1;39m,[0;32m"Zander_904"[0m[1;39m,[0;32m"Tue Jan 01 18:50:45 +0000 2019"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"1080174506419851265"[0m[1;39m,[0;32m"@annascup Pls finish it"[0m[1;39m,[0;32m"lisasbrows"[0m[1;39m,[0;32m"Tue Jan 01 18:50:45 +0000 2019"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"1080174506449215498"[0m[1;39m,[0;32m"RT @iMNI8pLAb4DF5qc: #اسقاط_القروض_لليوم_ال27\nظلمواالشعب الكريم\nفسموه مستهتر😡\nمستهتر\nبسبب قروض حدتهم الحاجة إليها\nإنها ليست قروض ترف\nمستهتر…"[0m[1;39m,[0;32m"majed50124909"[0m[1;39m,[0;32m"Tue Jan 01 18:50:45 +0000 2019"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"1080174506440826881"[0m[1;39m,[0;32m"@TheHausOfTi

[1;39m[[0;32m"1080174531610902529"[0m[1;39m,[0;32m"Cc j'viens de regarder l' S04E13 de JoJo's Bizarre A...!   #tvtime https://t.co/vSzfrLALjQ https://t.co/wzBBWjoYkf"[0m[1;39m,[0;32m"sidoow__"[0m[1;39m,[0;32m"Tue Jan 01 18:50:51 +0000 2019"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"1080174531598344193"[0m[1;39m,[0;32m"Evil babies are enjoying their Christmas present https://t.co/TaaMY7fEmc"[0m[1;39m,[0;32m"puffballpink"[0m[1;39m,[0;32m"Tue Jan 01 18:50:51 +0000 2019"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"1080174535771602944"[0m[1;39m,[0;32m"RT @UKFootball: sdkfasdibnrlndfo;nwerignwrig;newrfjwerunfdrkgbetgirrtbg!!!\n\nTOUCHDOWN, KENTUCKY!!!! Bowden to the HOUSE on a punt return!"[0m[1;39m,[0;32m"chasehunt32"[0m[1;39m,[0;32m"Tue Jan 01 18:50:52 +0000 2019"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"1080174535796768769"[0m[1;39m,[0;32m"RT @fizyoterapisti7: Adaletli branş dağılımı istiyoruz 4 basamaklı alımlar istiyoruz  #KamuyaFizyoterapist duyun sesimizi @dbd

[1;39m[[0;32m"1080174560950009856"[0m[1;39m,[0;32m"RT @Prettyboyfredo: 1000 RTs and I’ll drop 2 more codes !!! 🔥💕"[0m[1;39m,[0;32m"MarkoDaDonn"[0m[1;39m,[0;32m"Tue Jan 01 18:50:58 +0000 2019"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"1080174560962502657"[0m[1;39m,[0;32m"@CNN These Democrats don’t trust Mr. Mueller. They think he doesn’t knows how to do an investigation. With all the… https://t.co/QDqYhzIOJo"[0m[1;39m,[0;32m"cari_garrett"[0m[1;39m,[0;32m"Tue Jan 01 18:50:58 +0000 2019"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"1080174560941613057"[0m[1;39m,[0;32m"RT @kalkhazy: مسلي وقصي ما وقفوا مع الجمهور  وكل قراراتهم حسب ميولهم ،،، الجمهور ما راح يطبل لهم زيك ،، الخلا اقرب لهم https://t.co/qj34i2a…"[0m[1;39m,[0;32m"15SR640"[0m[1;39m,[0;32m"Tue Jan 01 18:50:58 +0000 2019"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"1080174560954208259"[0m[1;39m,[0;32m"@I’m not a scam have only gave two of my babies what they wanted... I want to be your sugar daddy.I don't want 

[1;39m[[0;32m"1080174594496122881"[0m[1;39m,[0;32m"RT @diefmendes: feliz 1964 !\n\nsabe pq 1964 ?\n\npq nesse ano aconteceu o maior golpe de todos.. dei um golpe com a pica no cu de quem ta lendo"[0m[1;39m,[0;32m"_lucianogs"[0m[1;39m,[0;32m"Tue Jan 01 18:51:06 +0000 2019"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"1080174594504491008"[0m[1;39m,[0;32m"beinf outed without choice is rhe worst tgibg that can happen to someone can y’all not"[0m[1;39m,[0;32m"realseokjinnie"[0m[1;39m,[0;32m"Tue Jan 01 18:51:06 +0000 2019"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"1080174594508627971"[0m[1;39m,[0;32m"RT @QueenAlienB: RELATIONSHIPS ARE HARD, THEY AREN'T JUST SO YOU HAVE SOMEONE CUTE TO POST ABOUT OR SO YOU HAVE A LABEL, THEY TAKE TIME, PA…"[0m[1;39m,[0;32m"LorTayy3"[0m[1;39m,[0;32m"Tue Jan 01 18:51:06 +0000 2019"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"1080174594517073921"[0m[1;39m,[0;32m"RT @pep_vilamala: #HappyNewYear \n#HappyNewYearEve\nClic https://t.co/59XUpxHZZg

[1;39m[[0;32m"1080174623864451072"[0m[1;39m,[0;32m"RT @Plania_JIN: 석진이 인생에 눈물은 하품할때 흐아야아아아암 하다가 그렁그렁 맺히는 정도였으면 좋겠다 https://t.co/Sg6WxMqPoG"[0m[1;39m,[0;32m"moon_130613_"[0m[1;39m,[0;32m"Tue Jan 01 18:51:13 +0000 2019"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"1080174623889731586"[0m[1;39m,[0;32m"RT @chuuzus: i did NOT spend my entire day making this for this to never see the light of day. Here is a recap of everything that happened…"[0m[1;39m,[0;32m"kendwizzle"[0m[1;39m,[0;32m"Tue Jan 01 18:51:13 +0000 2019"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"1080174623877144576"[0m[1;39m,[0;32m"RT @NYRangerstown: Damn it's January already? What's next, February? Fuck everything."[0m[1;39m,[0;32m"BuchToKreids"[0m[1;39m,[0;32m"Tue Jan 01 18:51:13 +0000 2019"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"1080174623881420800"[0m[1;39m,[0;32m"RT @lahnt1: بداية ٢٠١٩ كانت بإنقاذ شخص سقط من علو ٢٠ متر واصيب بعدة كسور متفرقة .. اسأل الله له الشفاء العاجل .. 🙏🏼\n\n#دبي https://t.co/

For filtering and output, it is helpful to transform the date into a more uesful format. Let's look at the original created_at date and transform it into an ISO 8601 date. 

In [5]:
!cat sample.json | jq -c '[.created_at, (.created_at | strptime("%A %B %d %T %z %Y") | todate)]'

[1;39m[[0;32m"Tue Jan 01 18:50:16 +0000 2019"[0m[1;39m,[0;32m"2019-01-01T13:50:16Z"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"Tue Jan 01 18:50:17 +0000 2019"[0m[1;39m,[0;32m"2019-01-01T13:50:17Z"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"Tue Jan 01 18:50:17 +0000 2019"[0m[1;39m,[0;32m"2019-01-01T13:50:17Z"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"Tue Jan 01 18:50:17 +0000 2019"[0m[1;39m,[0;32m"2019-01-01T13:50:17Z"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"Tue Jan 01 18:50:17 +0000 2019"[0m[1;39m,[0;32m"2019-01-01T13:50:17Z"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"Tue Jan 01 18:50:17 +0000 2019"[0m[1;39m,[0;32m"2019-01-01T13:50:17Z"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"Tue Jan 01 18:50:17 +0000 2019"[0m[1;39m,[0;32m"2019-01-01T13:50:17Z"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"Tue Jan 01 18:50:17 +0000 2019"[0m[1;39m,[0;32m"2019-01-01T13:50:17Z"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"Tue Jan 01 18:50:17 +0000 2019"[0m[1;39m,[0;32m"2019-01-01T13:50:17Z"[0m[1;39m[1;3

[1;39m[[0;32m"Tue Jan 01 18:50:26 +0000 2019"[0m[1;39m,[0;32m"2019-01-01T13:50:26Z"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"Tue Jan 01 18:50:26 +0000 2019"[0m[1;39m,[0;32m"2019-01-01T13:50:26Z"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"Tue Jan 01 18:50:26 +0000 2019"[0m[1;39m,[0;32m"2019-01-01T13:50:26Z"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"Tue Jan 01 18:50:26 +0000 2019"[0m[1;39m,[0;32m"2019-01-01T13:50:26Z"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"Tue Jan 01 18:50:26 +0000 2019"[0m[1;39m,[0;32m"2019-01-01T13:50:26Z"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"Tue Jan 01 18:50:26 +0000 2019"[0m[1;39m,[0;32m"2019-01-01T13:50:26Z"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"Tue Jan 01 18:50:26 +0000 2019"[0m[1;39m,[0;32m"2019-01-01T13:50:26Z"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"Tue Jan 01 18:50:26 +0000 2019"[0m[1;39m,[0;32m"2019-01-01T13:50:26Z"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"Tue Jan 01 18:50:26 +0000 2019"[0m[1;39m,[0;32m"2019-01-01T13:50:26Z"[0m[1;

[1;39m[[0;32m"Tue Jan 01 18:50:35 +0000 2019"[0m[1;39m,[0;32m"2019-01-01T13:50:35Z"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"Tue Jan 01 18:50:35 +0000 2019"[0m[1;39m,[0;32m"2019-01-01T13:50:35Z"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"Tue Jan 01 18:50:35 +0000 2019"[0m[1;39m,[0;32m"2019-01-01T13:50:35Z"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"Tue Jan 01 18:50:35 +0000 2019"[0m[1;39m,[0;32m"2019-01-01T13:50:35Z"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"Tue Jan 01 18:50:35 +0000 2019"[0m[1;39m,[0;32m"2019-01-01T13:50:35Z"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"Tue Jan 01 18:50:35 +0000 2019"[0m[1;39m,[0;32m"2019-01-01T13:50:35Z"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"Tue Jan 01 18:50:35 +0000 2019"[0m[1;39m,[0;32m"2019-01-01T13:50:35Z"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"Tue Jan 01 18:50:35 +0000 2019"[0m[1;39m,[0;32m"2019-01-01T13:50:35Z"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"Tue Jan 01 18:50:35 +0000 2019"[0m[1;39m,[0;32m"2019-01-01T13:50:35Z"[0m[1;

[1;39m[[0;32m"Tue Jan 01 18:50:40 +0000 2019"[0m[1;39m,[0;32m"2019-01-01T13:50:40Z"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"Tue Jan 01 18:50:40 +0000 2019"[0m[1;39m,[0;32m"2019-01-01T13:50:40Z"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"Tue Jan 01 18:50:40 +0000 2019"[0m[1;39m,[0;32m"2019-01-01T13:50:40Z"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"Tue Jan 01 18:50:40 +0000 2019"[0m[1;39m,[0;32m"2019-01-01T13:50:40Z"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"Tue Jan 01 18:50:40 +0000 2019"[0m[1;39m,[0;32m"2019-01-01T13:50:40Z"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"Tue Jan 01 18:50:40 +0000 2019"[0m[1;39m,[0;32m"2019-01-01T13:50:40Z"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"Tue Jan 01 18:50:40 +0000 2019"[0m[1;39m,[0;32m"2019-01-01T13:50:40Z"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"Tue Jan 01 18:50:40 +0000 2019"[0m[1;39m,[0;32m"2019-01-01T13:50:40Z"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"Tue Jan 01 18:50:40 +0000 2019"[0m[1;39m,[0;32m"2019-01-01T13:50:40Z"[0m[1;

[1;39m[[0;32m"Tue Jan 01 18:50:46 +0000 2019"[0m[1;39m,[0;32m"2019-01-01T13:50:46Z"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"Tue Jan 01 18:50:46 +0000 2019"[0m[1;39m,[0;32m"2019-01-01T13:50:46Z"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"Tue Jan 01 18:50:46 +0000 2019"[0m[1;39m,[0;32m"2019-01-01T13:50:46Z"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"Tue Jan 01 18:50:46 +0000 2019"[0m[1;39m,[0;32m"2019-01-01T13:50:46Z"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"Tue Jan 01 18:50:46 +0000 2019"[0m[1;39m,[0;32m"2019-01-01T13:50:46Z"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"Tue Jan 01 18:50:46 +0000 2019"[0m[1;39m,[0;32m"2019-01-01T13:50:46Z"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"Tue Jan 01 18:50:46 +0000 2019"[0m[1;39m,[0;32m"2019-01-01T13:50:46Z"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"Tue Jan 01 18:50:46 +0000 2019"[0m[1;39m,[0;32m"2019-01-01T13:50:46Z"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"Tue Jan 01 18:50:46 +0000 2019"[0m[1;39m,[0;32m"2019-01-01T13:50:46Z"[0m[1;

[1;39m[[0;32m"Tue Jan 01 18:50:53 +0000 2019"[0m[1;39m,[0;32m"2019-01-01T13:50:53Z"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"Tue Jan 01 18:50:53 +0000 2019"[0m[1;39m,[0;32m"2019-01-01T13:50:53Z"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"Tue Jan 01 18:50:53 +0000 2019"[0m[1;39m,[0;32m"2019-01-01T13:50:53Z"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"Tue Jan 01 18:50:53 +0000 2019"[0m[1;39m,[0;32m"2019-01-01T13:50:53Z"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"Tue Jan 01 18:50:53 +0000 2019"[0m[1;39m,[0;32m"2019-01-01T13:50:53Z"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"Tue Jan 01 18:50:53 +0000 2019"[0m[1;39m,[0;32m"2019-01-01T13:50:53Z"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"Tue Jan 01 18:50:53 +0000 2019"[0m[1;39m,[0;32m"2019-01-01T13:50:53Z"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"Tue Jan 01 18:50:53 +0000 2019"[0m[1;39m,[0;32m"2019-01-01T13:50:53Z"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"Tue Jan 01 18:50:53 +0000 2019"[0m[1;39m,[0;32m"2019-01-01T13:50:53Z"[0m[1;

[1;39m[[0;32m"Tue Jan 01 18:51:01 +0000 2019"[0m[1;39m,[0;32m"2019-01-01T13:51:01Z"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"Tue Jan 01 18:51:01 +0000 2019"[0m[1;39m,[0;32m"2019-01-01T13:51:01Z"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"Tue Jan 01 18:51:01 +0000 2019"[0m[1;39m,[0;32m"2019-01-01T13:51:01Z"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"Tue Jan 01 18:51:01 +0000 2019"[0m[1;39m,[0;32m"2019-01-01T13:51:01Z"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"Tue Jan 01 18:51:01 +0000 2019"[0m[1;39m,[0;32m"2019-01-01T13:51:01Z"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"Tue Jan 01 18:51:01 +0000 2019"[0m[1;39m,[0;32m"2019-01-01T13:51:01Z"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"Tue Jan 01 18:51:01 +0000 2019"[0m[1;39m,[0;32m"2019-01-01T13:51:01Z"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"Tue Jan 01 18:51:01 +0000 2019"[0m[1;39m,[0;32m"2019-01-01T13:51:01Z"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"Tue Jan 01 18:51:01 +0000 2019"[0m[1;39m,[0;32m"2019-01-01T13:51:01Z"[0m[1;

[1;39m[[0;32m"Tue Jan 01 18:51:09 +0000 2019"[0m[1;39m,[0;32m"2019-01-01T13:51:09Z"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"Tue Jan 01 18:51:09 +0000 2019"[0m[1;39m,[0;32m"2019-01-01T13:51:09Z"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"Tue Jan 01 18:51:09 +0000 2019"[0m[1;39m,[0;32m"2019-01-01T13:51:09Z"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"Tue Jan 01 18:51:09 +0000 2019"[0m[1;39m,[0;32m"2019-01-01T13:51:09Z"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"Tue Jan 01 18:51:09 +0000 2019"[0m[1;39m,[0;32m"2019-01-01T13:51:09Z"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"Tue Jan 01 18:51:09 +0000 2019"[0m[1;39m,[0;32m"2019-01-01T13:51:09Z"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"Tue Jan 01 18:51:09 +0000 2019"[0m[1;39m,[0;32m"2019-01-01T13:51:09Z"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"Tue Jan 01 18:51:09 +0000 2019"[0m[1;39m,[0;32m"2019-01-01T13:51:09Z"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"Tue Jan 01 18:51:09 +0000 2019"[0m[1;39m,[0;32m"2019-01-01T13:51:09Z"[0m[1;

Look at all of the tweets in particular language, in this case Spanish ("es"). 

In [6]:
!cat sample.json | jq -c 'select(.lang | contains("es")) | [.text]'

[1;39m[[0;32m"RT @aixagroetzner: Todos queremos que nos quieran en voz alta"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"RT @PCaterianoB: Cateriano afirma que el fujiaprismo se enfoca en “salvar el pellejo de ‘AG’ y ‘señora K’” | Canal N https://t.co/wEL1dpvsy1"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"RT @lusita_comunica: yo: me avisas si llegaste bien a tu casa\n\nmi amiga: *no lo hace*\n\nyo: https://t.co/GoOuK6wlT7"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"@ActusNonStop La dinde"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"RT @AldoDuqueSantos: Surge una esperanza en América ,de la mano de Jair Bolsonaro el coloso despierta y sacude la modorra de años de populi…"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"RT @lodoc4reta: por un 2019 sin amigos familiares conocidos tóxicos y q lo único tóxico q tengamos cerca sea el toxic by natalia y alba"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"RT @h4rryindie: estas no fueron las fiestas más raras en años??? no hubo espíritu navideño ni sentí la emoción por año nue

[1;39m[[0;32m"@agussanches382 Eso no es así hermanita, una equivocacion jajaja"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"RT @Maracuchozo: Y muchos cobres. https://t.co/iFKr6FHzyA"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"RT @AlexDago00BG: @MIAREsproject buena respuesta, super madura en una situacion asi"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"RT @alex_risicaro: Jorge Muñoz y el nuevo logotipo de la Municipalidad de Lima inaugurando gestión https://t.co/279bOEFbJs"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"RT @dearlightsx_: ES QUE LA FANTASIA DE NARRATIVA DE MIKI, JOAN Y ALFRED CANTANDO JUNTOS ????\n\nES QUE ME MUERO WIGGGGGGGGGGGGGGGG https://t…"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"@RGR_98 Me gusta los fursuits de zorros ^^"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"xdd\nenero: @DementeYT\nfebrero: @franndela_\nmarzo: @moodvelasco\nabril: @pedritoyuta\nmayo: @Pandom_yt\njunio:… https://t.co/4fUFuTFcjc"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"RT @UniNoticias: California se convierte en el pri

[1;39m[[0;32m"@Dsclpcaio todo dia"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"RT @leavemeal0ne_: qué pesadilla las 700 historias que está subiendo la gente de mejores momentos del año mira que me llevo tragando vuestr…"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"RT @ruidocontenido: Yo intentando fingir inocencia tras haber cometido una maldad: https://t.co/SBG3A5Prg3"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"RT @mymindflies: Por un 2019 menos homofóbico, machista, xenofobico, racista y violento."[0m[1;39m[1;39m][0m
[1;39m[[0;32m"@TomasBordone_ te quiero borrrŕrrracho❤"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"RT @ClubAmerica: #FiestaAzulcrema\nCelebremos todos juntos:\n🏆 El campeonato #1⃣3⃣\n🏆 1er Campeonato Femenil\n🏆 El Campeonato Sub-17\n\n No puede…"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"El avión que despegó el 1 de enero y aterrizó el 31 de diciembre https://t.co/RlacytHhjM https://t.co/uEnXNhuxbT"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"RT @CamilaAyalaaa: El vaguito con Bermudas y ca

Create a CSV with a subset of the fields.

In [7]:
!cat sample.json | jq -r '[.id_str, .created_at, .text] | @csv' > tweets.csv

In [8]:
!head -20 tweets.csv

"1080174384776732673","Tue Jan 01 18:50:16 +0000 2019","Happy birthday to a real one🖤 https://t.co/1o4FxnHgXg"
"1080174388975226881","Tue Jan 01 18:50:17 +0000 2019","RT @aixagroetzner: Todos queremos que nos quieran en voz alta"
"1080174388975157250","Tue Jan 01 18:50:17 +0000 2019","RT @PCaterianoB: Cateriano afirma que el fujiaprismo se enfoca en “salvar el pellejo de ‘AG’ y ‘señora K’” | Canal N https://t.co/wEL1dpvsy1"
"1080174388987809792","Tue Jan 01 18:50:17 +0000 2019","RT @QueridoJeito: “Ou você arregaça as mangas e luta pelo que tem, ou você decide que está cansado e você desiste.” 
— This Is Us"
"1080174388975153152","Tue Jan 01 18:50:17 +0000 2019","RT @kofi: maybe my wife will reveal herself to me this year. maybe i’ll become the man i’ve always wanted to be."
"1080174388987809793","Tue Jan 01 18:50:17 +0000 2019","@SaaD_M_H7 ماعندي جائزه للي يبتسم عادي تقدر تعدي المقطع"
"1080174388971036672","Tue Jan 01 18:50:17 +0000 2019","Here is what we came up with after 240 

Note that that newlines in the tweet text break the CSV format, which should only have line breaks at the end of the row. Let's fix problems with the text field in the tweets by replacing the \n newline character with a space. 

In [9]:
!cat sample.json | jq -r '[.id_str, (.text | gsub("\n";" "))] | @csv' > tweets.csv

In [10]:
!head -20 tweets.csv

"1080174384776732673","Happy birthday to a real one🖤 https://t.co/1o4FxnHgXg"
"1080174388975226881","RT @aixagroetzner: Todos queremos que nos quieran en voz alta"
"1080174388975157250","RT @PCaterianoB: Cateriano afirma que el fujiaprismo se enfoca en “salvar el pellejo de ‘AG’ y ‘señora K’” | Canal N https://t.co/wEL1dpvsy1"
"1080174388987809792","RT @QueridoJeito: “Ou você arregaça as mangas e luta pelo que tem, ou você decide que está cansado e você desiste.”  — This Is Us"
"1080174388975153152","RT @kofi: maybe my wife will reveal herself to me this year. maybe i’ll become the man i’ve always wanted to be."
"1080174388987809793","@SaaD_M_H7 ماعندي جائزه للي يبتسم عادي تقدر تعدي المقطع"
"1080174388971036672","Here is what we came up with after 240 days of work - looking for suggestions/opinions. #indiegames https://t.co/MAwP5vO8MV"
"1080174388987731968","RT @Sofihassan8: بڑا شور سنتے تھے پہلو میں دل کا                                          جو چیرا تو اک قطرہ خوں نہ نکلا…"

Hashtags are in nested JSON, so let's flattening the hashtags into a semi-colon delimited list.

In [11]:
!cat sample.json | jq -c 'select(.entities.hashtags | length >= 1) | [([.entities.hashtags[].text] | join(";"))]'

[1;39m[[0;32m"indiegames"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"potus;TDS;haters"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"PossePresidencial"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"trust;trusttheprocess;trustissues;life;livelifetofullest;motivating;motivationalquotes;motivationalquote"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"HappNewYear2019"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"GetUp"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"MerryChristmas"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"ARCplay"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"NH106;BackToTheFuture"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"IfChristmasWereAPerson"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"izmireſcort;bucaeſcort"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"انتصارات_السعوديه_العظمي_2018"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"BamFlinstone;OnTheBeat"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"UniqueNewYearCelebrations"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"TwitterTuesday;Lima;Peru"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"AudiogameJam3;ch

[1;39m[[0;32m"DFRT"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"Water;oceans"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"الاتحاد"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"민현;황민현;워너원;MinHyun;WannaOne"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"topnine;bestnine;HappyNewYear"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"윤지성;yoonjisung"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"현진;StrayKids;스트레이키즈;Hyunjin"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"Meb40BinŞubatAtaması"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"내갤러리_남자사진중_제일최근사진이_이상형이다"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"NewProfilePic"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"Thkaa"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"اسقاط_القروض;إسقاط_القروض_مطلب_شعبي;اسقاط_القروض_لليوم_28;اسقاط_القروض_لليوم_ال28"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"الإمارات"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"PossePresidencial"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"بياض_الاسنان;الخرج;الدلم;السيح;استشاري"[0m[1;39m[1;39m][0m
[1;39m[[0;32m"Soul;Funk;Jazz;LiveMusic"[0m[1;39m[1;39m