# Process Metadata

Jenna Jordan

### Purpose

One level of analysis compares what NYT sections articles in different queries belong to. This notebook processes the "other_metadata" field to extract the needed information, and matches each article (and metadata) up to what queries the article belongs to.

In [1]:
import pandas as pd

In [2]:
metadata_df = pd.read_csv("../Data/Metadata/bln_metadata.csv", dtype = {'aid':str, 'other_metadata':str}, encoding='utf-8')

In [3]:
article_df = pd.read_csv("../Data/Analyze/BLNqueries_compare_article-level_26Feb.csv", dtype = {'aid':str})

In [4]:
metadata_df

Unnamed: 0,aid,other_metadata
0,20181123210429337,"{""section"":""Section C; Page 3, Column 1; Scien..."
1,20181123210429804,"{""section"":""Section 4; Page 18, Column 1; Edit..."
2,20181123210429807,"{""section"":""Section 4; Page 18, Column 4; Edit..."
3,20181123210442721,"{""section"":""Section A; Page 21, Column 1; Nati..."
4,20181123210703764,"{""section"":""Section 6; Page 74, Column 1; Maga..."
...,...,...
174524,20190901000258418,"{""publication type"":""PUBLICATION-TYPE: Newswir..."
174525,20190901000258420,"{""publication type"":""PUBLICATION-TYPE: Newswir..."
174526,20190901000258508,"{""publication type"":""PUBLICATION-TYPE: Newswir..."
174527,20190901000258582,"{""publication type"":""PUBLICATION-TYPE: Newswir..."


In [5]:
metadata_df['other_metadata'] = metadata_df['other_metadata'].str.replace(r"\\\\", r"\\", regex=True)
metadata_df['other_metadata'] = metadata_df['other_metadata'].apply(lambda x : dict(eval(x)))

In [6]:
metadata = metadata_df['other_metadata'].apply(pd.Series)

In [7]:
df_m = pd.concat([metadata_df, metadata], axis=1)

In [8]:
article_df

Unnamed: 0,title,publisher,publication_date,aid,url,insect_population,insect_decline,pollinator_population,pollinator_decline,insect_apocalypse,colony_collapse,climate_change,climate_change_IPCCreport,insect_population_studies
0,Forty dead in flooding in eastern Turkey,AFP,1991-05-17,20190301235543664,https://advance.lexis.com/api/document?collect...,0,0,0,0,0,0,1,0,0
1,Czechoslovakia lighthouse of reform: minister,AFP,1991-05-31,20190301235551744,https://advance.lexis.com/api/document?collect...,0,0,0,0,0,0,1,0,0
2,"Oil consumers to boost stocks, better relation...",AFP,1991-06-03,20190302000102692,https://advance.lexis.com/api/document?collect...,0,0,0,0,0,0,1,0,0
3,Rising seas pose major threat to Pacific islands,AFP,1991-06-09,20190301235841326,https://advance.lexis.com/api/document?collect...,0,0,0,0,0,0,1,0,0
4,Indonesian group criticizes U.S. over global w...,AFP,1991-06-11,20190301235840927,https://advance.lexis.com/api/document?collect...,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
174336,Ex-Ethiopian PM urges Africa to embrace tech t...,XGNS,2019-08-16,20190830211535140,https://advance.lexis.com/api/document?collect...,0,0,0,0,0,0,1,0,0
174337,"Xinhua Asia-Pacific news summary at 1600 GMT, ...",XGNS,2019-08-17,20190831000249345,https://advance.lexis.com/api/document?collect...,0,0,0,0,0,0,1,0,0
174338,"1st LD: China, France should work together to ...",XGNS,2019-08-18,20190901000258420,https://advance.lexis.com/api/document?collect...,0,0,0,0,0,0,1,0,0
174339,Feature: Italy agricultural sector facing setb...,XGNS,2019-08-18,20190901000258414,https://advance.lexis.com/api/document?collect...,0,0,0,0,0,0,1,0,0


In [9]:
df = df_m.merge(article_df[['aid', 'publisher', 'title', 'publication_date', 'url']], on='aid', how='right')

In [10]:
df

Unnamed: 0,aid,other_metadata,section,byline,publication type,publisher,title,publication_date,url
0,20181123210429337,"{'section': 'Section C; Page 3, Column 1; Scie...","Section C; Page 3, Column 1; Science Desk",By WALTER SULLIVAN,,NYT,TWO NEW THEORIES OFFERED ON MASS EXTINCTIONS I...,1980-06-10,https://advance.lexis.com/api/document?collect...
1,20181123210429804,"{'section': 'Section 4; Page 18, Column 1; Edi...","Section 4; Page 18, Column 1; Editorial Desk",,,NYT,The Catch in Coal,1980-06-08,https://advance.lexis.com/api/document?collect...
2,20181123210429807,"{'section': 'Section 4; Page 18, Column 4; Edi...","Section 4; Page 18, Column 4; Editorial Desk",,,NYT,COAL MUST NOT BECOME KING,1980-06-08,https://advance.lexis.com/api/document?collect...
3,20181123210442721,"{'section': 'Section A; Page 21, Column 1; Nat...","Section A; Page 21, Column 1; National Desk",,,NYT,POLLLUTION'S IMPACT ON CHILDREN FEARED:U.N. AG...,1980-06-05,https://advance.lexis.com/api/document?collect...
4,20181123210703764,"{'section': 'Section 6; Page 74, Column 1; Mag...","Section 6; Page 74, Column 1; Magazine Desk",By Jane Ogle,,NYT,BEAUTY SWEET SCENTS TO REPEL PESTS,1980-06-01,https://advance.lexis.com/api/document?collect...
...,...,...,...,...,...,...,...,...,...
174336,20190901000258418,{'publication type': 'PUBLICATION-TYPE: Newswi...,INTERNATIONAL NEWS,刘芳,PUBLICATION-TYPE: Newswire,XGNS,"2nd LD Writethru: China, France should work to...",2019-08-18,https://advance.lexis.com/api/document?collect...
174337,20190901000258420,{'publication type': 'PUBLICATION-TYPE: Newswi...,INTERNATIONAL NEWS,刘芳,PUBLICATION-TYPE: Newswire,XGNS,"1st LD: China, France should work together to ...",2019-08-18,https://advance.lexis.com/api/document?collect...
174338,20190901000258508,{'publication type': 'PUBLICATION-TYPE: Newswi...,INTERNATIONAL NEWS,"By SETH BORENSTEIN, AP Science Writer",PUBLICATION-TYPE: Newswire,AP,Funeral for lost ice: Iceland bids farewell to...,2019-08-18,https://advance.lexis.com/api/document?collect...
174339,20190901000258582,{'publication type': 'PUBLICATION-TYPE: Newswi...,DOMESTIC NEWS,"By DAN JOLING, Associated Press",PUBLICATION-TYPE: Newswire,AP,"Blooms, beasts affected as Alaska records hott...",2019-08-18,https://advance.lexis.com/api/document?collect...


In [11]:
df_nyt = df.query("publisher=='NYT'")

In [12]:
df_nyt

Unnamed: 0,aid,other_metadata,section,byline,publication type,publisher,title,publication_date,url
0,20181123210429337,"{'section': 'Section C; Page 3, Column 1; Scie...","Section C; Page 3, Column 1; Science Desk",By WALTER SULLIVAN,,NYT,TWO NEW THEORIES OFFERED ON MASS EXTINCTIONS I...,1980-06-10,https://advance.lexis.com/api/document?collect...
1,20181123210429804,"{'section': 'Section 4; Page 18, Column 1; Edi...","Section 4; Page 18, Column 1; Editorial Desk",,,NYT,The Catch in Coal,1980-06-08,https://advance.lexis.com/api/document?collect...
2,20181123210429807,"{'section': 'Section 4; Page 18, Column 4; Edi...","Section 4; Page 18, Column 4; Editorial Desk",,,NYT,COAL MUST NOT BECOME KING,1980-06-08,https://advance.lexis.com/api/document?collect...
3,20181123210442721,"{'section': 'Section A; Page 21, Column 1; Nat...","Section A; Page 21, Column 1; National Desk",,,NYT,POLLLUTION'S IMPACT ON CHILDREN FEARED:U.N. AG...,1980-06-05,https://advance.lexis.com/api/document?collect...
4,20181123210703764,"{'section': 'Section 6; Page 74, Column 1; Mag...","Section 6; Page 74, Column 1; Magazine Desk",By Jane Ogle,,NYT,BEAUTY SWEET SCENTS TO REPEL PESTS,1980-06-01,https://advance.lexis.com/api/document?collect...
...,...,...,...,...,...,...,...,...,...
174305,20190901000257301,{'publication type': 'PUBLICATION-TYPE: Newspa...,Section SR; Column 0; Editorial Desk; LETTERS;...,,PUBLICATION-TYPE: Newspaper,NYT,Probing the Psyches of Mass Killers,2019-08-18,https://advance.lexis.com/api/document?collect...
174306,20190901000257318,{'publication type': 'PUBLICATION-TYPE: Newspa...,Section A; Column 0; Foreign Desk; Pg. 12,By RAPHAEL MINDER,PUBLICATION-TYPE: Newspaper,NYT,"Smokey Bear, Meet the Hungry Goats of Portugal...",2019-08-18,https://advance.lexis.com/api/document?collect...
174307,20190901000257345,{'publication type': 'PUBLICATION-TYPE: Newspa...,Section SR; Column 0; Sunday Review Desk; Pg. 2,By KATRIN JAKOBSDOTTIR,PUBLICATION-TYPE: Newspaper,NYT,An Ice-Free Iceland Is Not a Joke,2019-08-18,https://advance.lexis.com/api/document?collect...
174308,20190901000257461,{'publication type': 'PUBLICATION-TYPE: Newspa...,Section M2; Column 0; T: Women's Fashion Magaz...,By MICHAEL SNYDER,PUBLICATION-TYPE: Newspaper,NYT,Concrete Jungle,2019-08-18,https://advance.lexis.com/api/document?collect...


In [13]:
nyt_sections = df_nyt['section'].str.split(";", expand=True)

In [14]:
nyt_sections = nyt_sections.add_prefix("section")

In [15]:
nyt_sections

Unnamed: 0,section0,section1,section2,section3,section4,section5,section6
0,Section C,"Page 3, Column 1",Science Desk,,,,
1,Section 4,"Page 18, Column 1",Editorial Desk,,,,
2,Section 4,"Page 18, Column 4",Editorial Desk,,,,
3,Section A,"Page 21, Column 1",National Desk,,,,
4,Section 6,"Page 74, Column 1",Magazine Desk,,,,
...,...,...,...,...,...,...,...
174305,Section SR,Column 0,Editorial Desk,LETTERS,Pg. 10,,
174306,Section A,Column 0,Foreign Desk,Pg. 12,,,
174307,Section SR,Column 0,Sunday Review Desk,Pg. 2,,,
174308,Section M2,Column 0,T: Women's Fashion Magazine,Pg. 190,,,


In [16]:
df_nyt_all = pd.concat([df_nyt, nyt_sections], axis = 1)

In [17]:
df_nyt_all['desk_name'] = df_nyt_all['section'].str.extract(pat = r"([A-Za-z\s]+)\sDesk")

In [18]:
df_nyt_all['page'] = df_nyt_all['section'].str.extract(pat = r"Page\s([0-9]+)")

In [19]:
df_nyt_all['pg'] = df_nyt_all['section'].str.extract(pat = r"Pg\.\s([0-9]+)")

In [20]:
df_nyt_all['page_number'] = df_nyt_all['page'].fillna(df_nyt_all['pg'])

In [21]:
df_nyt_all['section_name'] = df_nyt_all['section'].str.extract(pat = r"Section\s([A-Z0-9]+)")

In [22]:
df_nyt_all['type_newspaper'] = 0
df_nyt_all['type_web'] = 0
df_nyt_all['type_newspaper'] [df_nyt_all['publication type']=='PUBLICATION-TYPE: Newspaper'] = 1
df_nyt_all['type_web'] [df_nyt_all['publication type']=='PUBLICATION-TYPE: Web Blog'] = 1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_nyt_all['type_newspaper'] [df_nyt_all['publication type']=='PUBLICATION-TYPE: Newspaper'] = 1
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_nyt_all['type_web'] [df_nyt_all['publication type']=='PUBLICATION-TYPE: Web Blog'] = 1


In [23]:
df_nyt_all

Unnamed: 0,aid,other_metadata,section,byline,publication type,publisher,title,publication_date,url,section0,...,section4,section5,section6,desk_name,page,pg,page_number,section_name,type_newspaper,type_web
0,20181123210429337,"{'section': 'Section C; Page 3, Column 1; Scie...","Section C; Page 3, Column 1; Science Desk",By WALTER SULLIVAN,,NYT,TWO NEW THEORIES OFFERED ON MASS EXTINCTIONS I...,1980-06-10,https://advance.lexis.com/api/document?collect...,Section C,...,,,,Science,3,,3,C,0,0
1,20181123210429804,"{'section': 'Section 4; Page 18, Column 1; Edi...","Section 4; Page 18, Column 1; Editorial Desk",,,NYT,The Catch in Coal,1980-06-08,https://advance.lexis.com/api/document?collect...,Section 4,...,,,,Editorial,18,,18,4,0,0
2,20181123210429807,"{'section': 'Section 4; Page 18, Column 4; Edi...","Section 4; Page 18, Column 4; Editorial Desk",,,NYT,COAL MUST NOT BECOME KING,1980-06-08,https://advance.lexis.com/api/document?collect...,Section 4,...,,,,Editorial,18,,18,4,0,0
3,20181123210442721,"{'section': 'Section A; Page 21, Column 1; Nat...","Section A; Page 21, Column 1; National Desk",,,NYT,POLLLUTION'S IMPACT ON CHILDREN FEARED:U.N. AG...,1980-06-05,https://advance.lexis.com/api/document?collect...,Section A,...,,,,National,21,,21,A,0,0
4,20181123210703764,"{'section': 'Section 6; Page 74, Column 1; Mag...","Section 6; Page 74, Column 1; Magazine Desk",By Jane Ogle,,NYT,BEAUTY SWEET SCENTS TO REPEL PESTS,1980-06-01,https://advance.lexis.com/api/document?collect...,Section 6,...,,,,Magazine,74,,74,6,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
174305,20190901000257301,{'publication type': 'PUBLICATION-TYPE: Newspa...,Section SR; Column 0; Editorial Desk; LETTERS;...,,PUBLICATION-TYPE: Newspaper,NYT,Probing the Psyches of Mass Killers,2019-08-18,https://advance.lexis.com/api/document?collect...,Section SR,...,Pg. 10,,,Editorial,,10,10,SR,1,0
174306,20190901000257318,{'publication type': 'PUBLICATION-TYPE: Newspa...,Section A; Column 0; Foreign Desk; Pg. 12,By RAPHAEL MINDER,PUBLICATION-TYPE: Newspaper,NYT,"Smokey Bear, Meet the Hungry Goats of Portugal...",2019-08-18,https://advance.lexis.com/api/document?collect...,Section A,...,,,,Foreign,,12,12,A,1,0
174307,20190901000257345,{'publication type': 'PUBLICATION-TYPE: Newspa...,Section SR; Column 0; Sunday Review Desk; Pg. 2,By KATRIN JAKOBSDOTTIR,PUBLICATION-TYPE: Newspaper,NYT,An Ice-Free Iceland Is Not a Joke,2019-08-18,https://advance.lexis.com/api/document?collect...,Section SR,...,,,,Sunday Review,,2,2,SR,1,0
174308,20190901000257461,{'publication type': 'PUBLICATION-TYPE: Newspa...,Section M2; Column 0; T: Women's Fashion Magaz...,By MICHAEL SNYDER,PUBLICATION-TYPE: Newspaper,NYT,Concrete Jungle,2019-08-18,https://advance.lexis.com/api/document?collect...,Section M2,...,,,,,,190,190,M2,1,0


In [24]:
df_nyt_all = df_nyt_all.drop(columns = ['pg', 'page'])

In [25]:
df_nyt_all.to_csv("../Data/Metadata/nyt_metadata.csv", index = False)

In [27]:
df_nyt_all['section0'].value_counts().head(20)

Section A     9994
OPINION       2465
Section C     2083
SCIENCE       1722
Section       1524
Section B     1401
US            1154
Section D     1109
Section F      864
WORLD          789
BRIEFING       630
Section 1      621
CLIMATE        576
Section SR     561
BUSINESS       518
Section BR     451
Section 4      429
Section MM     351
Section E      346
Section 7      334
Name: section0, dtype: int64

In [29]:
df_nyt_q_m = df_nyt_all.merge(article_df, on=['aid', 'title', 'publisher', 'publication_date', 'url'], how='left')

In [30]:
df_nyt_q_m

Unnamed: 0,aid,other_metadata,section,byline,publication type,publisher,title,publication_date,url,section0,...,type_web,insect_population,insect_decline,pollinator_population,pollinator_decline,insect_apocalypse,colony_collapse,climate_change,climate_change_IPCCreport,insect_population_studies
0,20181123210429337,"{'section': 'Section C; Page 3, Column 1; Scie...","Section C; Page 3, Column 1; Science Desk",By WALTER SULLIVAN,,NYT,TWO NEW THEORIES OFFERED ON MASS EXTINCTIONS I...,1980-06-10,https://advance.lexis.com/api/document?collect...,Section C,...,0,0,0,0,0,0,0,1,0,0
1,20181123210429804,"{'section': 'Section 4; Page 18, Column 1; Edi...","Section 4; Page 18, Column 1; Editorial Desk",,,NYT,The Catch in Coal,1980-06-08,https://advance.lexis.com/api/document?collect...,Section 4,...,0,0,0,0,0,0,0,1,0,0
2,20181123210429807,"{'section': 'Section 4; Page 18, Column 4; Edi...","Section 4; Page 18, Column 4; Editorial Desk",,,NYT,COAL MUST NOT BECOME KING,1980-06-08,https://advance.lexis.com/api/document?collect...,Section 4,...,0,0,0,0,0,0,0,1,0,0
3,20181123210442721,"{'section': 'Section A; Page 21, Column 1; Nat...","Section A; Page 21, Column 1; National Desk",,,NYT,POLLLUTION'S IMPACT ON CHILDREN FEARED:U.N. AG...,1980-06-05,https://advance.lexis.com/api/document?collect...,Section A,...,0,0,0,0,0,0,0,1,0,0
4,20181123210703764,"{'section': 'Section 6; Page 74, Column 1; Mag...","Section 6; Page 74, Column 1; Magazine Desk",By Jane Ogle,,NYT,BEAUTY SWEET SCENTS TO REPEL PESTS,1980-06-01,https://advance.lexis.com/api/document?collect...,Section 6,...,0,1,0,1,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32542,20190901000257301,{'publication type': 'PUBLICATION-TYPE: Newspa...,Section SR; Column 0; Editorial Desk; LETTERS;...,,PUBLICATION-TYPE: Newspaper,NYT,Probing the Psyches of Mass Killers,2019-08-18,https://advance.lexis.com/api/document?collect...,Section SR,...,0,0,0,0,0,0,0,1,0,0
32543,20190901000257318,{'publication type': 'PUBLICATION-TYPE: Newspa...,Section A; Column 0; Foreign Desk; Pg. 12,By RAPHAEL MINDER,PUBLICATION-TYPE: Newspaper,NYT,"Smokey Bear, Meet the Hungry Goats of Portugal...",2019-08-18,https://advance.lexis.com/api/document?collect...,Section A,...,0,0,0,0,0,0,0,1,0,0
32544,20190901000257345,{'publication type': 'PUBLICATION-TYPE: Newspa...,Section SR; Column 0; Sunday Review Desk; Pg. 2,By KATRIN JAKOBSDOTTIR,PUBLICATION-TYPE: Newspaper,NYT,An Ice-Free Iceland Is Not a Joke,2019-08-18,https://advance.lexis.com/api/document?collect...,Section SR,...,0,0,0,0,0,0,0,1,1,0
32545,20190901000257461,{'publication type': 'PUBLICATION-TYPE: Newspa...,Section M2; Column 0; T: Women's Fashion Magaz...,By MICHAEL SNYDER,PUBLICATION-TYPE: Newspaper,NYT,Concrete Jungle,2019-08-18,https://advance.lexis.com/api/document?collect...,Section M2,...,0,0,0,0,0,0,0,1,0,0


In [31]:
df_nyt_q_m.to_csv("../Data/Metadata/nyt_articles_withmetadata.csv", index=False)