# Clean apostrophes, etc
haven't, there's, don't, also quote marks

In [89]:
import pandas as pd
path = '/data/storyq/robot-ai-all-public.csv'
data = pd.read_csv(path, encoding='utf8')
len(data)

17073

In [9]:
# Search for punctuation issue
data = data.dropna(subset=['Paragraph'])
eg = data.loc[data['Paragraph'].str.contains('haven'), ['Paragraph']]
eg

Unnamed: 0,Paragraph
18,"I haven۪t heard the robot make a mistake,۝ sai..."
19,"I haven۪t heard the robot make a mistake,۝ sai..."
20,"I haven۪t heard the robot make a mistake,۝ sai..."
1887,Anthony Catania got suspicious when his floor ...
1888,Anthony Catania got suspicious when his floor ...
1889,Anthony Catania got suspicious when his floor ...
2178,Robotic technology will have to be developed a...
2179,Robotic technology will have to be developed a...
2180,Robotic technology will have to be developed a...
2379,"Norman Mack, the head of Robutler, is not conc..."


In [13]:
# Search for punctuation issue
data = data.dropna(subset=['Paragraph'])
# eg = data.loc[data['Paragraph'].str.contains('a lot of emotion'), ['Paragraph']] # theres
eg = data.loc[data['Paragraph'].str.contains('want to just walk around'), ['Paragraph']] # theres
eg

Unnamed: 0,Paragraph
15,"You don۪t want to just walk around like a robot and act like you don۪t care,۝ Coghlan said. Oh, we hit a walk-off and came from behind. I guess that۪s a good job. Let۪s high-five each other.۪ There۪s a lot of emotion, a lot of passion in this game. You have to let those play out and be smart.۝"
16,"You don۪t want to just walk around like a robot and act like you don۪t care,۝ Coghlan said. Oh, we hit a walk-off and came from behind. I guess that۪s a good job. Let۪s high-five each other.۪ There۪s a lot of emotion, a lot of passion in this game. You have to let those play out and be smart.۝"
17,"You don۪t want to just walk around like a robot and act like you don۪t care,۝ Coghlan said. Oh, we hit a walk-off and came from behind. I guess that۪s a good job. Let۪s high-five each other.۪ There۪s a lot of emotion, a lot of passion in this game. You have to let those play out and be smart.۝"


In [10]:
pd.set_option('display.max_colwidth', None)

In [84]:
# Manually fix issues
transform = {
    'n۪': "n'",
    '۝': '',
    'e۪': "e'",
    't۪': "t'",
    '.۪': '.',
    '\xa0': ' ',
    '\xad': '',
    'ʉ': ' ',
    'I۪': "I'",
    'O۪': "O'",
    '̢': '',
    ' ̢': ' ',
    't̢': 't',
    'n̢': "n'",
    's۪': "s'",
    'C۪': "C'",
    'Ӊ': ' ',
    '\u2008': ' ',
    '\u2009': '',
    '•': '',
    '&#8217;': "'",
    '&#8216;': '',
    '&#x2014;': ':',
}

def clean(text):
    for char, newchar in transform.items():
        text = text.replace(char, newchar)
    return text

In [85]:
# Clean the text
data = data.dropna(subset=['Paragraph', 'Title'])
cleaned = data.copy()
cleaned['Paragraph'] = data['Paragraph'].map(clean)
cleaned['Title'] = data['Title'].map(clean)

In [69]:
# Tests
from IPython.display import display

phrases = ['a lot of emotion',
          'the robot make a mistake']
for p in phrases:
    display(data.loc[data['Paragraph'].str.contains(p), ['Paragraph']]) # theres
    display(cleaned.loc[cleaned['Paragraph'].str.contains(p), ['Paragraph']]) # theres

Unnamed: 0,Paragraph
15,"You don۪t want to just walk around like a robot and act like you don۪t care,۝ Coghlan said. Oh, we hit a walk-off and came from behind. I guess that۪s a good job. Let۪s high-five each other.۪ There۪s a lot of emotion, a lot of passion in this game. You have to let those play out and be smart.۝"
16,"You don۪t want to just walk around like a robot and act like you don۪t care,۝ Coghlan said. Oh, we hit a walk-off and came from behind. I guess that۪s a good job. Let۪s high-five each other.۪ There۪s a lot of emotion, a lot of passion in this game. You have to let those play out and be smart.۝"
17,"You don۪t want to just walk around like a robot and act like you don۪t care,۝ Coghlan said. Oh, we hit a walk-off and came from behind. I guess that۪s a good job. Let۪s high-five each other.۪ There۪s a lot of emotion, a lot of passion in this game. You have to let those play out and be smart.۝"


Unnamed: 0,Paragraph
15,"You don't want to just walk around like a robot and act like you don't care, Coghlan said. Oh, we hit a walk-off and came from behind. I guess that's a good job. Let's high-five each other. There's a lot of emotion, a lot of passion in this game. You have to let those play out and be smart."
16,"You don't want to just walk around like a robot and act like you don't care, Coghlan said. Oh, we hit a walk-off and came from behind. I guess that's a good job. Let's high-five each other. There's a lot of emotion, a lot of passion in this game. You have to let those play out and be smart."
17,"You don't want to just walk around like a robot and act like you don't care, Coghlan said. Oh, we hit a walk-off and came from behind. I guess that's a good job. Let's high-five each other. There's a lot of emotion, a lot of passion in this game. You have to let those play out and be smart."


Unnamed: 0,Paragraph
18,"I haven۪t heard the robot make a mistake,۝ said Andrew Albert, chairman of the New York City Transit Riders Council. I have heard the human make a mistake.۝"
19,"I haven۪t heard the robot make a mistake,۝ said Andrew Albert, chairman of the New York City Transit Riders Council. I have heard the human make a mistake.۝"
20,"I haven۪t heard the robot make a mistake,۝ said Andrew Albert, chairman of the New York City Transit Riders Council. I have heard the human make a mistake.۝"


Unnamed: 0,Paragraph
18,"I haven't heard the robot make a mistake, said Andrew Albert, chairman of the New York City Transit Riders Council. I have heard the human make a mistake."
19,"I haven't heard the robot make a mistake, said Andrew Albert, chairman of the New York City Transit Riders Council. I have heard the human make a mistake."
20,"I haven't heard the robot make a mistake, said Andrew Albert, chairman of the New York City Transit Riders Council. I have heard the human make a mistake."


In [71]:
# Check for weird characters
chars = set([c for s in cleaned.Title.tolist() for c in s])
chars

{' ',
 '!',
 '"',
 '#',
 '$',
 '&',
 "'",
 '(',
 ')',
 '*',
 ',',
 '-',
 '.',
 '/',
 '0',
 '1',
 '2',
 '3',
 '4',
 '5',
 '6',
 '7',
 '8',
 '9',
 ':',
 ';',
 '<',
 '>',
 '?',
 'A',
 'B',
 'C',
 'D',
 'E',
 'F',
 'G',
 'H',
 'I',
 'J',
 'K',
 'L',
 'M',
 'N',
 'O',
 'P',
 'Q',
 'R',
 'S',
 'T',
 'U',
 'V',
 'W',
 'X',
 'Y',
 'Z',
 '_',
 'a',
 'b',
 'c',
 'd',
 'e',
 'f',
 'g',
 'h',
 'i',
 'j',
 'k',
 'l',
 'm',
 'n',
 'o',
 'p',
 'q',
 'r',
 's',
 't',
 'u',
 'v',
 'w',
 'x',
 'y',
 'z',
 '|',
 'á',
 '̩',
 '۪',
 '—',
 '‘',
 '’'}

In [79]:
pd.set_option('display.max_rows', None)
cleaned[['Title']].drop_duplicates()

Unnamed: 0,Title
0,LONG ISLAND JOURNAL
3,Camera System Creates Sophisticated 3-D Effects
6,Robotic Gadgets for Household Chores
9,HOPES SET ON A FEW FILM HITS
12,Designers Take Robots Out of Human Hands
15,"A Moment of Celebration, and Months of Recovery for Coghlan"
18,Transit Agency Weighs a Digital Upgrade for Aging Subway Cars
21,What Happens When Robots Write the Future?
24,"Results Unproven, Robotic Surgery Wins Converts"
27,A Robot For the Masses


In [86]:
cleaned.loc[cleaned.Title.str.contains('#'), ['Title']]

Unnamed: 0,Title


In [87]:
len(cleaned)

17058

In [90]:
# Save out
cleaned.to_csv('/data/storyq/ai_news_annotated.csv')

In [92]:
cleaned.head()

Unnamed: 0,Article ID,Article Date,Paragraph number,NYT section,Paragraph,Title,WorkTimeInSeconds,AI Mood,AI Relevance,Fiction,...,Other (negative),Cyborg (positive),Decisions (positive),Education (positive),Entertain (positive),Healthcare (positive),Singularity (positive),Transportation (positive),Work (positive),Other (positive)
0,4fd1cbc98eb7c8105d701286,1996-10-06 00:00:00 UTC,18,New York and Region,"Thus, next weekend will feature the robot who is named Sico (pronounced SEE-co). ''He can speak seven languages,'' Ms. Finkel said, ''and he interacts. Whatever language you speak he can speak.'' Of course he has never spoken to a Long Islander. No one is quite sure how he will interpret the Long Island accent.",LONG ISLAND JOURNAL,1472,4,5,0,...,{},0,0,0,0,0,0,0,0,{}
1,4fd1cbc98eb7c8105d701286,1996-10-06 00:00:00 UTC,18,New York and Region,"Thus, next weekend will feature the robot who is named Sico (pronounced SEE-co). ''He can speak seven languages,'' Ms. Finkel said, ''and he interacts. Whatever language you speak he can speak.'' Of course he has never spoken to a Long Islander. No one is quite sure how he will interpret the Long Island accent.",LONG ISLAND JOURNAL,49,4,5,0,...,{},0,0,0,0,0,0,0,0,{}
2,4fd1cbc98eb7c8105d701286,1996-10-06 00:00:00 UTC,18,New York and Region,"Thus, next weekend will feature the robot who is named Sico (pronounced SEE-co). ''He can speak seven languages,'' Ms. Finkel said, ''and he interacts. Whatever language you speak he can speak.'' Of course he has never spoken to a Long Islander. No one is quite sure how he will interpret the Long Island accent.",LONG ISLAND JOURNAL,66,5,5,0,...,{},0,0,0,1,0,0,0,0,{}
3,54b0793b7988100e21965770,2006-07-31 00:00:00 UTC,16,Technology,"That phrase was coined in the 1970۪s by Masahiro Mori, the Japanese robotics specialist, as he sought to describe the emotional response of humans to robots and other nonhuman entities. He theorized that as a robot became more lifelike, the emotional response of humans became increasingly positive and empathetic until a certain point at which the robot took on a zombie-like quality, and the human response turned to repulsion. Then, as the robot becomes indistinguishable from a human, the response turns positive again. Critics were quick to point out the eerie look of the characters in Polar Express.",Camera System Creates Sophisticated 3-D Effects,3053,3,4,0,...,{},0,0,0,0,0,0,0,0,{}
4,54b0793b7988100e21965770,2006-07-31 00:00:00 UTC,16,Technology,"That phrase was coined in the 1970۪s by Masahiro Mori, the Japanese robotics specialist, as he sought to describe the emotional response of humans to robots and other nonhuman entities. He theorized that as a robot became more lifelike, the emotional response of humans became increasingly positive and empathetic until a certain point at which the robot took on a zombie-like quality, and the human response turned to repulsion. Then, as the robot becomes indistinguishable from a human, the response turns positive again. Critics were quick to point out the eerie look of the characters in Polar Express.",Camera System Creates Sophisticated 3-D Effects,25,3,4,0,...,{},0,0,0,0,0,0,0,0,{}


In [65]:
char = '•'
eg = cleaned.loc[cleaned.Paragraph.str.contains(char), ['Paragraph']]
eg

Unnamed: 0,Paragraph
8016,"• Uh-oh. A panel on how artificial intelligence could automatically produce news stories, at Columbia Journalism School. 6:30 p.m. [Free, with livestream]"
8017,"• Uh-oh. A panel on how artificial intelligence could automatically produce news stories, at Columbia Journalism School. 6:30 p.m. [Free, with livestream]"
8018,"• Uh-oh. A panel on how artificial intelligence could automatically produce news stories, at Columbia Journalism School. 6:30 p.m. [Free, with livestream]"
8163,"• There will be an emphasis on social interaction, including a “My Home” element that will allow players to virtually meet with fellow racers, sharing photos and user-created courses. There’s also a mode called Remote Racer, for controlling an avatar racer with artificial intelligence, and GT Life, an upgraded method of choosing and following career racing paths."
8164,"• There will be an emphasis on social interaction, including a “My Home” element that will allow players to virtually meet with fellow racers, sharing photos and user-created courses. There’s also a mode called Remote Racer, for controlling an avatar racer with artificial intelligence, and GT Life, an upgraded method of choosing and following career racing paths."
8165,"• There will be an emphasis on social interaction, including a “My Home” element that will allow players to virtually meet with fellow racers, sharing photos and user-created courses. There’s also a mode called Remote Racer, for controlling an avatar racer with artificial intelligence, and GT Life, an upgraded method of choosing and following career racing paths."
9060,"• Uh-oh. A panel on how artificial intelligence could automatically produce news stories, at Columbia Journalism School. 6:30 p.m. [Free, with livestream]"
9061,"• Uh-oh. A panel on how artificial intelligence could automatically produce news stories, at Columbia Journalism School. 6:30 p.m. [Free, with livestream]"
9062,"• Uh-oh. A panel on how artificial intelligence could automatically produce news stories, at Columbia Journalism School. 6:30 p.m. [Free, with livestream]"
9207,"• There will be an emphasis on social interaction, including a “My Home” element that will allow players to virtually meet with fellow racers, sharing photos and user-created courses. There’s also a mode called Remote Racer, for controlling an avatar racer with artificial intelligence, and GT Life, an upgraded method of choosing and following career racing paths."


In [64]:
eg.Paragraph.str.replace(char, "HERE IT IS")

13500    But there was also the fact that the vibe was just a little strange, what with the underlying interest in polyamory and cryonics, along with the widespread concern that the apocalypse, in the form of a civilization-destroying artificial intelligence, was imminent. When I asked why a group of rationalists would disproportionately share such views, people tended to cite the mind-expanding powers of rational thought. ‘‘This community is much more open to actually evaluating weird ideas,’’ Andrew told me. ‘‘They’re willing to put in the effort to explore the question, rather than saying: ‘Oh, this is outside my window. Bye.’HERE IT IS’’ But the real reason, many acknowledged, was CFAR’s connection to Yudkowsky. Compulsive and rather grandiose, Yudkowsky is known for proclaiming the imminence of the A.I. apocalypse (‘‘I wouldn’t be surprised if tomorrow was the Final Dawn, the last sunrise before the earth and sun are reshaped into computing elements’’) and his own role as savior (

In [39]:
eg.iloc[4]['Paragraph'].replace(char, 'HERE IT IS')

'Taking issue with Joseph Schumpeter, who saw late capitalism’s “disintegration of the bourgeois family” as part of a broader “decomposition” of collective life, Klinenberg nevertheless exemplifies Schumpeter’s famous definition of a “civilized man” (in distinction to a “barbarian”) as one who recognizes the “relative validity” of his “convictions” while being able to “stand for them unflinchingly.” On those terms, Klinenberg is a civilized man indeed. His only false note for this reader — though hardly an exception to his civility — is his willingness to give HERE IT ISserious consideration to the use of robots as companions for the aged. But even here his intelligence allows him no more than a formal bow (“By the time the current generation of young adults reaches old age, their comfort with machines will make robotic companions even more attractive”) as opposed to the groveling prostration often observed in the presence of the All-High Tech. With unassailable judgment he says that “