## De-Duping the CSV

In order to make sure that the data in the original Google sheet was correct, we kept more columns than we needed. In this notebook, we are going to examine the columns in the "one to rule them all" notebook and decide which columns can be deleted. The original merged notebook we will be unaffected as we will be working with a duplicate file.

In [43]:
import pandas as pd
import string

In [14]:
# =-=-=-=-=-=-=-=-=-=-=
# LOAD the CSV into a dataframe
# =-=-=-=-=-=-=-=-=-=-= 

# Let python create the column names list:
with open('./TEDtalks_all.csv') as f:
    colnames = f.readline().strip().split(",")

# Let Python assign the pandas row index using the Talk ID (first column in CSV)
# with open('./TEDtalks_all.csv') as f:
#     talkids = [row.split()[0] for row in f]
# Include in read_csv below: index = talkids

df = pd.read_csv('./TEDtalks_all.csv', names=colnames, skiprows=1)

df.head(10)

Unnamed: 0,Talk ID,public_url,speaker_name,headline,description_x,event,duration_x,language,published,tags,speaker,duration_y,uploaded,views,description_y,text
0,1,https://www.ted.com/talks/al_gore_on_averting_...,Al Gore,Averting the climate crisis,With the same humor and humanity he exuded in ...,TED2006,0:16:17,en,6/27/06,"alternative energy,cars,global issues,climate ...",Al Gore,PT16M17S,2006-06-27T00:11:00+00:00,3266733,With the same humor and humanity he exuded in ...,"Thank you so much, Chris. And it's truly a g..."
1,7,https://www.ted.com/talks/david_pogue_says_sim...,David Pogue,Simplicity sells,New York Times columnist David Pogue takes aim...,TED2006,0:21:26,en,6/27/06,"simplicity,entertainment,interface design,soft...",David Pogue,PT21M26S,2006-06-27T00:11:00+00:00,1702201,New York Times columnist David Pogue takes aim...,"(Music: ""The Sound of Silence,"" Simon & Garf..."
2,53,https://www.ted.com/talks/majora_carter_s_tale...,Majora Carter,Greening the ghetto,"In an emotionally charged talk, MacArthur-winn...",TED2006,0:18:36,en,6/27/06,"MacArthur grant,cities,green,activism,politics...",Majora Carter,PT18M36S,2006-06-27T00:11:00+00:00,2000421,"In an emotionally charged talk, MacArthur-winn...",If you're here today — and I'm very happy th...
3,66,https://www.ted.com/talks/ken_robinson_says_sc...,Ken Robinson,Do schools kill creativity?,Sir Ken Robinson makes an entertaining and pro...,TED2006,0:19:24,en,6/27/06,"children,teaching,creativity,parenting,culture...",Ken Robinson,PT19M24S,2006-06-27T00:11:00+00:00,51614087,Sir Ken Robinson makes an entertaining and pro...,Good morning. How are you? (Laughter) ...
4,92,https://www.ted.com/talks/hans_rosling_shows_t...,Hans Rosling,The best stats you've ever seen,You've never seen data presented like this. Wi...,TED2006,0:19:50,en,6/27/06,"demo,Asia,global issues,visualizations,global ...",Hans Rosling,PT19M50S,2006-06-27T20:38:00+00:00,12662135,You've never seen data presented like this. Wi...,"About 10 years ago, I took on the task to te..."
5,96,https://www.ted.com/talks/tony_robbins_asks_wh...,Tony Robbins,Why we do what we do,"Tony Robbins discusses the ""invisible forces"" ...",TED2006,0:21:45,en,6/27/06,"entertainment,goal-setting,potential,psycholog...",Tony Robbins,PT21M45S,2006-06-27T20:38:00+00:00,22368699,"Tony Robbins discusses the ""invisible forces"" ...",Thank you. I have to tell you I'm both chall...
6,49,https://www.ted.com/talks/joshua_prince_ramus_...,Joshua Prince-Ramus,Behind the design of Seattle's library,Architect Joshua Prince-Ramus takes the audien...,TED2006,0:19:58,en,7/10/06,"library,architecture,design,culture,collaboration",Joshua Prince-Ramus,PT19M58S,2006-07-10T00:11:00+00:00,1042335,Architect Joshua Prince-Ramus takes the audien...,I'm going to present three projects in rapid...
7,86,https://www.ted.com/talks/julia_sweeney_on_let...,Julia Sweeney,Letting go of God,When two young Mormon missionaries knock on Ju...,TED2006,0:16:32,en,7/10/06,"atheism,Christianity,religion,God,comedy,humor...",Julia Sweeney,PT16M32S,2006-07-10T00:11:00+00:00,3903747,When two young Mormon missionaries knock on Ju...,"On September 10, the morning of my seventh b..."
8,71,https://www.ted.com/talks/rick_warren_on_a_lif...,Rick Warren,A life of purpose,"Pastor Rick Warren, author of ""The Purpose-Dri...",TED2006,0:21:02,en,7/18/06,"Christianity,philanthropy,religion,God,happine...",Rick Warren,PT21M2S,2006-07-18T00:11:00+00:00,3361934,"Pastor Rick Warren, author of ""The Purpose-Dri...","I'm often asked, ""What surprised you about t..."
9,94,https://www.ted.com/talks/dan_dennett_s_respon...,Dan Dennett,Let's teach religion -- all religion -- in sch...,Philosopher Dan Dennett calls for religion -- ...,TED2006,0:24:45,en,7/18/06,"atheism,consciousness,evolution,philosophy,rel...",Dan Dennett,PT24M45S,2006-07-18T00:11:00+00:00,2751013,Philosopher Dan Dennett calls for religion — a...,It's wonderful to be back. I love this wonde...


In [10]:
df.shape

(2686, 16)

Are all talks in English?

In [6]:
len(df.loc[df['language'] != 'en'])

30

In [16]:
# To see those rows, since there are only thirty:
# (Pandas' native display in Jupyter is nicer than the printed version.)

df.loc[df['language'] != 'en']

Unnamed: 0,Talk ID,public_url,speaker_name,headline,description_x,event,duration_x,language,published,tags,speaker,duration_y,uploaded,views,description_y,text
967,1120,https://www.ted.com/talks/sarah_kaminsky,Sarah Kaminsky,My father the forger,Sarah Kaminsky tells the extraordinary story o...,TEDxParis 2010,0:14:00,fr,9/7/11,"entertainment,war,global issues,storytelling,h...",Sarah Kaminsky,PT14M,2011-09-07T01:01:00+00:00,609267,Sarah Kaminsky tells the extraordinary story o...,"I am the daughter of a forger, not just any ..."
984,1235,https://www.ted.com/talks/danielle_de_niese_a_...,Danielle de Niese,A flirtatious aria,Can opera be ever-so-slightly sexy? The glorio...,TEDGlobal 2011,0:05:55,de,9/30/11,"theater,entertainment,sex,creativity",Danielle de Niese,PT5M55S,2011-09-30T14:50:29+00:00,817067,Can opera be ever-so-slightly sexy? The glorio...,"(Music) ♫ I don't understand myself, ♫ ♫ ..."
997,1250,https://www.ted.com/talks/guy_philippe_goldste...,Guy-Philippe Goldstein,How cyberattacks threaten real-world peace,Nations can now attack other nations with cybe...,TEDxParis 2010,0:09:24,fr,10/19/11,"war,terrorism,security,global issues,politics,...",Guy-Philippe Goldstein,PT9M24S,2011-10-19T15:27:46+00:00,489600,Nations can now attack other nations with cybe...,Good afternoon. If you have followed diploma...
1010,1263,https://www.ted.com/talks/sandra_fisher_martin...,Sandra Fisher-Martins,The right to understand,"Medical, legal, and financial documents should...",TEDxO'Porto,0:15:42,pt,11/6/11,"simplicity,language,law,design,TEDx,culture",Sandra Fisher-Martins,PT15M42S,2011-11-06T14:59:43+00:00,322344,"Medical, legal, and financial documents should...","The story I want to tell you about, started ..."
1379,1653,https://www.ted.com/talks/young_ha_kim_be_an_a...,Young-ha Kim,"Be an artist, right now!",Why do we ever stop playing and creating? With...,TEDxSeoul,0:16:57,ko,2/15/13,"writing,art,creativity,TEDx,spoken word",Young-ha Kim,PT16M57S,2013-02-15T15:53:11+00:00,1919266,Why do we ever stop playing and creating? With...,"The theme of my talk today is, ""Be an artist..."
1447,1721,https://www.ted.com/talks/liu_bolin_the_invisi...,Liu Bolin,The invisible man,Can a person disappear in plain sight? That's ...,TED2013,0:07:46,zh-cn,5/15/13,"china,global issues,photography,art,identity",Liu Bolin,PT7M46S,2013-05-15T14:44:26+00:00,1332484,Can a person disappear in plain sight? That's ...,"Liu Bolin: By making myself invisible, I try..."
1554,1783,https://www.ted.com/talks/mohamed_hijri_a_simp...,Mohamed Hijri,A simple solution to the coming phosphorus crisis,There's a farming crisis no one is talking abo...,TEDxUdeM,0:13:41,fr,10/29/13,"botany,ecology,plants,Anthropocene,farming,agr...",Mohamed Hijri,PT13M41S,2013-10-29T15:02:28+00:00,635801,There's a farming crisis no one is talking abo...,I'm going to start by asking you a question:...
1594,1803,https://www.ted.com/talks/suzanne_talhouk_don_...,Suzanne Talhouk,Don't kill your language,"More and more, English is a global language; s...",TEDxBeirut,0:14:12,ar,1/6/14,"language,TEDx,poetry,culture",Suzanne Talhouk,PT14M12S,2014-01-06T16:31:34+00:00,1298939,"More and more, English is a global language; s...",Good morning! Are you awake? They took my na...
1622,1925,https://www.ted.com/talks/yann_dall_aglio_love...,Yann Dall'Aglio,Love -- you're doing it wrong,"In this delightful talk, philosopher Yann Dall...",TEDxParis 2012,0:10:42,fr,2/14/14,"philosophy,love,TEDx,relationships,culture",Yann Dall'Aglio,PT10M42S,2014-02-14T15:52:57+00:00,4067273,"In this delightful talk, philosopher Yann Dall...",What is love? It's a hard term to define in ...
1766,2088,https://www.ted.com/talks/antonio_donato_nobre...,Antonio Donato Nobre,The magic of the Amazon: A river that flows in...,"The Amazon River is like a heart, pumping wate...",TEDxAmazonia,0:21:35,pt-br,9/19/14,"rivers,trees,global commons,climate change,nat...",Antonio Donato Nobre,PT21M35S,2014-09-19T15:04:54+00:00,1044681,"The Amazon River is like a heart, pumping wate...",What do you guys think? For those who watche...


The first thing to note is that almost all of the talks have transcripts that are in English, so some translation has taken place. The quality of the translation is something we could inspect or we can accept that it has been publicly accepted, which is also the case for the current transcriptions. The real question is, then, do we include these talks as part of our analysis or not? Since these represent approximately 1% of the talks, I would argue that dropping them would mean little to the larger program, and we would be better served by seeking out these and other talks in their original languages to see of the results are similar for the English-language talks.

Second, the good news is that so far as these particular talks go, the majority of the talks appear to be in French or Spanish, which are *not* that distant from English in terms of linguistic expectations, unlike the talks in Chinese or Hindu. Let's count the two and add them to be sure:

In [17]:
len(df.loc[df['language'] == 'fr']) + len(df.loc[df['language'] == 'es'])

20

Can we identify those rows and remove them from the dataframe? There is `df.drop`, but we can also simply filter for the English language rows, create a new dataframe, and then save that as a separate CSV (later). Once we have the new dataframe, `df_en`, we can check its shape to see if the correct number of rows were removed.

In [22]:
df_en = df.loc[df['language'] == 'en']

df_en.shape

(2656, 16)

### Comparing Columns

We know that some of the columns are duplicates. Some are fairly easy to compare: there's a difference in formatting and we can decide which we prefer. Others, like speakers and descriptions, we want to make sure we are not throwing away any metadata which could be useful later. Both sets of columns are strings. In the case of the descriptions, it looks like those  from the Google document have had fancier punctuation -- e.g., smart quotes and em dashes -- replaced. This matters little, so differences here are trivial: but it would help to establish this by removing punctuation and comparing just the words.

What I want to write is something like this:

```python
for row in rows:
    if df_en['description_x'][row] != df_en['description_y'][row]:
        print(df_en['Talk ID'][row])
```

But that isn't quite "*the pandas way*" [/Sean Connery in _The Untouchables_].

In [25]:
# Here's the difference between two descriptions:

row = 5 # This just happens to be the Anthony Robbins descriptions that
        # have differenct typography for the em dash.
        # I used this setup so I would only have to change the number once.
        
print(df_en['description_x'][row] + "\n" + df_en['description_y'][row])

Tony Robbins discusses the "invisible forces" that motivate everyone's actions -- and high-fives Al Gore in the front row.
Tony Robbins discusses the "invisible forces" that motivate everyone's actions — and high-fives Al Gore in the front row.


In [48]:
punct = dict.fromkeys(map(ord, '''!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~—'''), None)

print(df_en['description_y'][row].translate(punct) + "\n" 
      + df_en['description_y'][row].translate(punct))

Tony Robbins discusses the invisible forces that motivate everyones actions  and highfives Al Gore in the front row
Tony Robbins discusses the invisible forces that motivate everyones actions  and highfives Al Gore in the front row


In [None]:
for row in df:
    if column_a != column_b
    print(row_id)

In [51]:
print(df_en.loc[df_en['description_x'].translate(punct) != df_en['description_y'].translate(punct)])

AttributeError: 'Series' object has no attribute 'translate'

## Dropping the Columns

The syntax for dropping a column in Pandas is very simple:

    df.drop('column_name', axis=1)

The `axis=1` argument tells Pandas that we are working with a column.

In [52]:
print(colnames)

['Talk ID', 'public_url', 'speaker_name', 'headline', 'description_x', 'event', 'duration_x', 'language', 'published', 'tags', 'speaker', 'duration_y', 'uploaded', 'views', 'description_y', 'text']


In [56]:
list(df_en)

['Talk ID',
 'public_url',
 'speaker_name',
 'headline',
 'description_x',
 'event',
 'duration_x',
 'language',
 'published',
 'tags',
 'speaker',
 'duration_y',
 'uploaded',
 'views',
 'description_y',
 'text']

In [60]:
drop_columns = ['speaker', 'duration_y', 'uploaded', 'description_y', 'language']
df_drop = df_en.drop(drop_columns, axis=1)

In [61]:
df_drop.shape

(2656, 11)

In [62]:
df_drop.head()

Unnamed: 0,Talk ID,public_url,speaker_name,headline,description_x,event,duration_x,published,tags,views,text
0,1,https://www.ted.com/talks/al_gore_on_averting_...,Al Gore,Averting the climate crisis,With the same humor and humanity he exuded in ...,TED2006,0:16:17,6/27/06,"alternative energy,cars,global issues,climate ...",3266733,"Thank you so much, Chris. And it's truly a g..."
1,7,https://www.ted.com/talks/david_pogue_says_sim...,David Pogue,Simplicity sells,New York Times columnist David Pogue takes aim...,TED2006,0:21:26,6/27/06,"simplicity,entertainment,interface design,soft...",1702201,"(Music: ""The Sound of Silence,"" Simon & Garf..."
2,53,https://www.ted.com/talks/majora_carter_s_tale...,Majora Carter,Greening the ghetto,"In an emotionally charged talk, MacArthur-winn...",TED2006,0:18:36,6/27/06,"MacArthur grant,cities,green,activism,politics...",2000421,If you're here today — and I'm very happy th...
3,66,https://www.ted.com/talks/ken_robinson_says_sc...,Ken Robinson,Do schools kill creativity?,Sir Ken Robinson makes an entertaining and pro...,TED2006,0:19:24,6/27/06,"children,teaching,creativity,parenting,culture...",51614087,Good morning. How are you? (Laughter) ...
4,92,https://www.ted.com/talks/hans_rosling_shows_t...,Hans Rosling,The best stats you've ever seen,You've never seen data presented like this. Wi...,TED2006,0:19:50,6/27/06,"demo,Asia,global issues,visualizations,global ...",12662135,"About 10 years ago, I took on the task to te..."


In [63]:
% ls

Google_list.csv                       descriptions.csv
TEDtalks_2018_edited.csv              descriptions_URL_list.txt
TEDtalks_2018_raw.csv                 [1m[34mdiscussions[m[m/
TEDtalks_all.csv                      discussions_URL_list.txt
Td-00-added_files.ipynb               gender.py
Td-01-Parsing_the_Transcripts.ipynb   transcript_log.txt
[1m[31mTd-02-Parsing_the_Descriptions.ipynb[m[m* [1m[34mtranscripts[m[m/
Td-03-merging.ipynb                   transcripts copy.csv
Td-04-deduping_the_csv.ipynb          transcripts.csv
[1m[34mdescriptions[m[m/                         transcripts_URL_list.txt
[1m[34mdescriptions-pbmatic[m[m/


In [64]:
df_drop.to_csv('tedtalks2018.csv')