In [1]:
import pandas as pd 
import regex as re
import csv
import datetime

### Clean the dataframe of abstracts and dates, everytime read csv need to convert date to datetime and abstract to string. 

In [2]:
df=pd.read_csv('clinton_csv')

In [4]:
df.sort_values('date', inplace=True)

In [6]:
df.reset_index(inplace=True)

In [7]:
df

Unnamed: 0,index,abstract,date
0,14775,Pres and Mrs Clinton hold millennium party at ...,2000-01-02T05:00:00+0000
1,14961,"For much of his presidency, Bill Clinton's own...",2000-01-03T05:00:00+0000
2,14802,As the first of two moving trucks turned onto ...,2000-01-05T05:00:00+0000
3,14976,It is a reality of modern campaigns that conte...,2000-01-05T05:00:00+0000
4,14982,Clinton Pushes Peace Talks President Clinton...,2000-01-05T05:00:00+0000
...,...,...,...
16436,13495,A new Gender Policy Council will look differen...,2021-02-16T18:27:12+0000
16437,13493,Nearly three decades after the White House est...,2021-02-16T18:27:23+0000
16438,14095,With a following of 15 million and a divisive ...,2021-02-17T17:35:38+0000
16439,13494,Rush Limbaugh made the G.O.P. the party of mis...,2021-02-20T11:55:04+0000


#### Convert all abstracts to floats

In [8]:
df['abstract']=df['abstract'].astype(str)

In [9]:
df.iloc[2][1]

'As the first of two moving trucks turned onto Old House Lane, the new homeowners were nowhere to be seen. And rather than bring over a plate of cookies, one neighbor phoned the police to complain about the reporters huddled at the end of his driveway.    This was moving day, first family-style, on a street that has grown a bit weary of the publicity surrounding the anticipated arrival of President Clinton and Hillary Rodham Clinton.  '

#### Remove (M) and (S) from abstract

In [10]:
df['abstract']=df['abstract'].apply(lambda x: re.sub(r"\(.*\)", "", x))

In [11]:
df['abstract'][7]

'Is it something in the water?    On the same day that Hillary Rodham Clinton was moving from Washington to Chappaqua, N.Y., in preparation for a run for the United States Senate, William F. Weld, the former governor of Massachusetts, was moving from Cambridge to the Upper East Side of Manhattan, presenting himself as a possible Republican candidate for governor of New York.  '

#### Remove "backslash" from text - only representation of strong, not actual

In [12]:
df['abstract']=df['abstract'].apply(lambda x: x.replace('\\',''))

In [13]:
df['abstract'][8311] #### backslash represents apostrophe, but doesn't show up in print, see below 

'As he prepares to take office, President-elect Barack Obama is relying on a small team of advisers who will lead his transition operation and help choose the members of a new Obama administration. Following is part of a series of profiles of potential members of the administration.'

In [14]:
print(df['abstract'][8311])

As he prepares to take office, President-elect Barack Obama is relying on a small team of advisers who will lead his transition operation and help choose the members of a new Obama administration. Following is part of a series of profiles of potential members of the administration.


#### Try and identify jibberish abstracts (extra long etc)


In [15]:
df['abstract'].str.len().sort_values(ascending=False).head(30)

3922    11690
2880     8616
3200     7350
3104     7305
2908     6475
3204     6271
2901     6247
2849     6234
2784     6072
2947     6038
2758     5980
2967     5925
3035     5889
2738     5827
2746     5811
3225     5720
2889     5704
3084     5668
2990     5647
3186     5642
2935     5618
2915     5552
3533     5529
3152     5448
3175     5414
2678     5387
2945     5287
2995     5057
3347     4955
3532     4922
Name: abstract, dtype: int64

In [17]:
df['abstract'][3922]

"Following is the transcript of Senator Hillary Rodham Clinton's remarks at the Council on Foreign Relations, as provided by CQ Transcriptions, Inc.CLINTON: You all know the litany of threats and challenges: the metastasizing threat of terrorists networks recruiting troops, setting up training camps, amassing weapons; a regime in North Korea openly testing missiles and nuclear weapons; an activist, expansionist Iran pursuing its own nuclear arsenal; a resurgent Taliban in Afghanistan;and an emerging civil war in Iraq; Russia and China pursuing their own interests, often at odds with such global imperatives as nuclear nonproliferation; and ending genocide in Darfur. Oil has never been more important in funding unstable, anti- American governments and yet we have failed to make the investments necessary to move more rapidly to alternative fuels, a policy that is now as important to our national security and our Mideast strategy as to our economy and environment. The lost opportunities of

In [18]:
df['abstract'][2880]

" INTERNATIONAL   A3-11    Re-Education CampsPose Problem for China  A vast penal system in China that is separate from the judicial system and is a relic of the Mao era is presenting a dilemma for a modern-day Communist Party that remains obsessed with security and political control.   A1    Sunni Refuses Cabinet Post  One of four Sunni Arabs picked over the weekend to join Iraq's new Shiite-controlled cabinet abruptly rejected the job, saying that he learned of his selection from a television news report and adding that he felt it would further a quota system for Sunnis that would only make sectarian problems worse.   A11    Insurgents in Iraq are drawing on dozens of stockpiled, bomb-rigged cars and groups of foreign fighters smuggled into the country in recent weeks to carry out most of the suicide attacks that have killed about 300 people in the last 10 days, senior American officers say.   A11    Bush Meets With Putin  President Bush met with President Vladimir V. Putin of Russia

In [19]:
df['abstract'][3200]

" INTERNATIONAL   A3-13    Role of Secret Police Examined After Attacks  In Jordan and across the Middle East, those seeking democratic reform say the central role of each country's secret police force is one of the biggest impediments to change, but last week's terror attacks in Amman accentuate one reason why even some reformers justify the secret police's blanket presence -- the fear that violence can spill across the border.   A1    Jordanian security officials announced the arrest of an Iraqi woman closely linked to the terrorist leader Abu Musab al-Zarqawi as a fourth bomber in the Amman hotel attacks. They also broadcast a taped confession showing her wearing a translucent suicide explosive belt packed with ball bearings and describing how she had tried unsuccessfully to blow herself up.   A1    Iraq Projects Moving Forward  The American official in charge of reconstruction in Iraq said that American-financed projects were moving forward, but that officials had not publicized th

In [20]:
mask=(df['abstract'].str.len()>1200)

In [21]:
df[mask]['abstract'].to_list()

["Pres Clinton sends Congress budget for fiscal 2001 built on idea that burgeoning surplus will allow nation to have it all, from higher federal spending to tax cuts and reductions in national debt; his eighth and final spending plan calls for additional spending on issues long central to his agenda, including health care, education, environment as well as increases for military; stresses fiscal responsibility and maps path to eliminating portion of national debt held by public over next 13 years; attempts to define surplus in manner that builds priorities into budget mix for next decade, leaving Republicans with relatively little room to pursue sweeping tax cuts; calls for spending nearly $1.84 trillion, including increase of 3.9 percent in spending on third of budget that covers basic government operations; puts heavy emphasis on shoring up and expanding Medicare, designating $299 billion out of 10-year surplus projection of $746 billion to extend system's solvency to 2025; would cre

#### Remove punctuation 

In [22]:
df['abstract']=df['abstract'].apply(lambda x: re.sub(r'[^\w\s]', '', x))

In [23]:
df['abstract'][8311]

'As he prepares to take office Presidentelect Barack Obama is relying on a small team of advisers who will lead his transition operation and help choose the members of a new Obama administration Following is part of a series of profiles of potential members of the administration'

#### Remove extra spaces in between and at the end/beginning

In [24]:
df['abstract']=df['abstract'].apply(lambda x: x.replace("\\s{2,}"," "))

In [25]:
df['abstract']=df['abstract'].apply(lambda x: x.strip())

In [26]:
df

Unnamed: 0,index,abstract,date
0,14775,Pres and Mrs Clinton hold millennium party at ...,2000-01-02T05:00:00+0000
1,14961,For much of his presidency Bill Clintons own a...,2000-01-03T05:00:00+0000
2,14802,As the first of two moving trucks turned onto ...,2000-01-05T05:00:00+0000
3,14976,It is a reality of modern campaigns that conte...,2000-01-05T05:00:00+0000
4,14982,Clinton Pushes Peace Talks President Clinton ...,2000-01-05T05:00:00+0000
...,...,...,...
16436,13495,A new Gender Policy Council will look differen...,2021-02-16T18:27:12+0000
16437,13493,Nearly three decades after the White House est...,2021-02-16T18:27:23+0000
16438,14095,With a following of 15 million and a divisive ...,2021-02-17T17:35:38+0000
16439,13494,Rush Limbaugh made the GOP the party of misogyny,2021-02-20T11:55:04+0000


#### Convert date to datetime, do everytime load csv to pandas

In [27]:
type(df['date'][0])

str

In [28]:
df['date']=pd.to_datetime(df['date'])

In [29]:
df

Unnamed: 0,index,abstract,date
0,14775,Pres and Mrs Clinton hold millennium party at ...,2000-01-02 05:00:00+00:00
1,14961,For much of his presidency Bill Clintons own a...,2000-01-03 05:00:00+00:00
2,14802,As the first of two moving trucks turned onto ...,2000-01-05 05:00:00+00:00
3,14976,It is a reality of modern campaigns that conte...,2000-01-05 05:00:00+00:00
4,14982,Clinton Pushes Peace Talks President Clinton ...,2000-01-05 05:00:00+00:00
...,...,...,...
16436,13495,A new Gender Policy Council will look differen...,2021-02-16 18:27:12+00:00
16437,13493,Nearly three decades after the White House est...,2021-02-16 18:27:23+00:00
16438,14095,With a following of 15 million and a divisive ...,2021-02-17 17:35:38+00:00
16439,13494,Rush Limbaugh made the GOP the party of misogyny,2021-02-20 11:55:04+00:00


#### Remove Numbers

In [30]:
df['abstract']=df['abstract'].apply(lambda x: re.sub(r'\d+', '', x))

#### Change to lowercase 

In [31]:
df['abstract']=df['abstract'].apply(lambda x: x.lower())

#### Drop index

In [32]:
df.drop('index',axis=1, inplace=True)

In [33]:
df

Unnamed: 0,abstract,date
0,pres and mrs clinton hold millennium party at ...,2000-01-02 05:00:00+00:00
1,for much of his presidency bill clintons own a...,2000-01-03 05:00:00+00:00
2,as the first of two moving trucks turned onto ...,2000-01-05 05:00:00+00:00
3,it is a reality of modern campaigns that conte...,2000-01-05 05:00:00+00:00
4,clinton pushes peace talks president clinton ...,2000-01-05 05:00:00+00:00
...,...,...
16436,a new gender policy council will look differen...,2021-02-16 18:27:12+00:00
16437,nearly three decades after the white house est...,2021-02-16 18:27:23+00:00
16438,with a following of million and a divisive st...,2021-02-17 17:35:38+00:00
16439,rush limbaugh made the gop the party of misogyny,2021-02-20 11:55:04+00:00


#### Save to csv 

In [34]:
df.to_csv('clean_abstract_clinton', index=False, header=True)