This is a very important notebook. 

In this notebook, I explain how I convert the `csv` files into `json` files for api. 

## Expand paper.json

Let's first add `authorships` and `author_names`. 

In [155]:
import pandas as pd 
import numpy as np 
from json import loads, dumps
from collections import OrderedDict

In [156]:
papers = pd.read_csv('../data/processed/papers.csv')
papers.columns = ['paper_id', 'title', 'paper_type', 'abstract', 
                  'number_of_authors', 'year', 'session', 'division', 'authors']
papers.drop(['authors'], inplace=True, axis = 1)
papers.head()

Unnamed: 0,paper_id,title,paper_type,abstract,number_of_authors,year,session,division
0,2003-0001,Access to the Media Versus Access to Audiences...,Paper,When the issue of speakers' rights of access a...,1.0,2003,,
1,2003-0002,Accounting Episodes as Communicative Practice ...,Paper,In this paper I describe accounting episodes a...,1.0,2003,,
2,2003-0003,Accounts of Single-fatherhood: A case study,Paper,Abstract\nRelying on single-fathers accounts ...,4.0,2003,,
3,2003-0004,A Challenge to the Duel: Socializing Dedicated...,Paper,This paper explores the structural controls av...,1.0,2003,,
4,2003-0005,A chatroom ethnography: Evolution of community...,Paper,"In creating an ethnography about the City, Tex...",1.0,2003,,


In [157]:
paper_ids = papers.paper_id.unique()

In [158]:
papers_json = loads(papers.to_json(orient='records'))
papers_json[0]

{'paper_id': '2003-0001',
 'title': 'Access to the Media Versus Access to Audiences: The Distinction and its Implications for Media Regulation and Policy',
 'paper_type': 'Paper',
 'abstract': 'When the issue of speakers\' rights of access arises in media regulation and policy contexts, the focus typically is on the concept of speakers\' rights of access "to the media," or "to the press." This right typically is premised on the audience\'s need for access to diverse sources and content. In contrast, in many non-mediated contexts, the concept of speakers\' rights of access frequently is defined in terms of the speaker\'s own First Amendment right of access to audiences. This paper explores the distinctions between these differing interpretations of a speaker\'s access rights and argues that the concept of a speaker\'s right of access to audiences merits a more prominent position in media regulation and policy. This paper then explores the implications of such a shift in perspective for 

In [159]:
authors = pd.read_csv('../data/processed/authors.csv')
authors.head()

Unnamed: 0,Paper ID,Title,Number of Authors,Author Position,Author Name,Author Affiliation,Year
0,2003-0001,Access to the Media Versus Access to Audiences...,1,1,Philip Napoli,Fordham U,2003
1,2003-0002,Accounting Episodes as Communicative Practice ...,1,1,Mariko Kotani,Aoyama Gakuin University,2003
2,2003-0003,Accounts of Single-fatherhood: A case study,4,1,Tara M Emmers-Sommer,University of Arizona,2003
3,2003-0003,Accounts of Single-fatherhood: A case study,4,2,David Rhea,University of Arizona,2003
4,2003-0003,Accounts of Single-fatherhood: A case study,4,3,Laura Triplett,University of Arizona,2003


In [190]:
authorships_dict = {}
paperid_authors_dic = {}
# for every paper
for paper_id, group in authors.groupby('Paper ID'):
    paperid_authors_dic[paper_id] = list(group['Author Name'])
    authorships = []
    author_names = group['Author Name'].tolist()
    affs = group['Author Affiliation'].tolist()
    for i, author_name in enumerate(author_names):
        dic = {}
        dic['position'] = i
        dic['author_name'] = author_name 
        dic['author_affiliation'] = affs[i]
        authorships.append(dic)
    authorships_dict[paper_id] = authorships

In [191]:
authorships_dict[paper_ids[-1]]

[{'position': 0, 'author_name': 'Åsa Kroon', 'author_affiliation': 'Örebro U'},
 {'position': 1,
  'author_name': 'Mats Erik Ekstrom',
  'author_affiliation': 'Orebro U'}]

In [162]:
for paper_dic in papers_json:
    try:
        paper_dic['authorships'] = authorships_dict[paper_dic['paper_id']]
        paper_dic['author_names'] = paperid_authors_dic[paper_dic['paper_id']]
    except:
        print(paper_dic)

{'paper_id': '2005-0806', 'title': 'Interpersonal and Intrapersonal Motives to Acquire Information from Mediated Messages', 'paper_type': 'Paper', 'abstract': 'The present investigation explores the influences of interpersonal (intrinsic) and intrapersonal (extrinsic) motives on information acquisition from mediated messages, as well as the influences these motives may have on each other. Intrinsic and extrinsic motives were operationalized as personal interest and expectations of future relevant discussion, respectively. Respondents received a manipulation that elevated the expectation of discussing certain topics with unknown students and then viewed a newscast featuring these topics. Personal interest in and information acquisition of each message were assessed, along with anticipations of topical discussion with friends or family. Results showed that intrinsic and extrinsic interests related positively to information acquisition indicators for the relevant news stories. In addition

In [163]:
papers_json[-1]

{'paper_id': '2018-0255',
 'title': 'The Impact of Presenting Physiological Data During Sporting Events on Audiences Entertainment',
 'paper_type': 'Poster',
 'abstract': 'Psychophysiological data has been useful in many domains and this study examines the use of such information in the domain of sports audiences. This study employs a four condition experiment in which participants watched a short sports clip displaying different physiological measures in the corner. The participants were then asked about their perceptions of the clip. Broadly, there was not much difference between groups based on the types of information presented, however, presenting blood pressure information proved to be the most entertaining for audiences. This provides early evidence that the presentation of physiological information during a sporting event can impact feelings of enjoyment, meaningfulness, and perceptions of knowledge of the sport. There is promise for these measures to be used in sports media pr

Now, let's add `session_info`:

In [164]:
sessions = pd.read_csv('../data/processed/sessions.csv')
sessions.head()

Unnamed: 0,Year,Session Type,Session Title,Division/Unit,Chair Name,Chair Affiliation
0,2014,Paper Session,Meda Coverage of Health Issues,Health Communication,Xiaoli Nan,U of Maryland
1,2014,Paper Session,Cognition and Health,Health Communication,Seth M. Noar,U of North Carolina
2,2014,Paper Session,Changing the News 140 Characters At a Time: Tw...,Journalism Studies,Seth C. Lewis,U of Minnesota
3,2014,Paper Session,Media and Political Contestation in Greater China,Global Communication and Social Change,Guobin Yang,University of Pennsylvania
4,2014,Paper Session,Between Science and the Public: Studies in Sci...,Journalism Studies,Henrik Ornebring,Karlstad U


In [165]:
session_dic = {}
for session, group in sessions.groupby('Session Title'):
    dic = {}
    dic['session'] = session
    dic['session_type'] = group['Session Type'].tolist()[0]
    dic['chair_name'] = group['Chair Name'].tolist()[0]
    dic['chair_affiliation'] = group['Chair Affiliation'].tolist()[0]
    dic['division'] = group['Division/Unit'].tolist()[0]
    session_dic[session] = dic 

In [166]:
session_dic['Sports Communication Interactive Poster Session']

{'session': 'Sports Communication Interactive Poster Session',
 'session_type': 'Interactive Paper Session',
 'chair_name': nan,
 'chair_affiliation': nan,
 'division': 'In Event: ICA Plenary Interactive Paper/Poster Session II'}

In [167]:
for paper_dic in papers_json:
    try:
        paper_dic['session_info'] = session_dic[paper_dic['session']]
    except:
        # this is because even if papers have a session, 
        # this session might be included in sessions_df
        paper_dic['session_info'] = np.nan

In [168]:
papers_json[-1]

{'paper_id': '2018-0255',
 'title': 'The Impact of Presenting Physiological Data During Sporting Events on Audiences Entertainment',
 'paper_type': 'Poster',
 'abstract': 'Psychophysiological data has been useful in many domains and this study examines the use of such information in the domain of sports audiences. This study employs a four condition experiment in which participants watched a short sports clip displaying different physiological measures in the corner. The participants were then asked about their perceptions of the clip. Broadly, there was not much difference between groups based on the types of information presented, however, presenting blood pressure information proved to be the most entertaining for audiences. This provides early evidence that the presentation of physiological information during a sporting event can impact feelings of enjoyment, meaningfulness, and perceptions of knowledge of the sport. There is promise for these measures to be used in sports media pr

The above is the `papers.json`.

In [169]:
papers_json

[{'paper_id': '2003-0001',
  'title': 'Access to the Media Versus Access to Audiences: The Distinction and its Implications for Media Regulation and Policy',
  'paper_type': 'Paper',
  'abstract': 'When the issue of speakers\' rights of access arises in media regulation and policy contexts, the focus typically is on the concept of speakers\' rights of access "to the media," or "to the press." This right typically is premised on the audience\'s need for access to diverse sources and content. In contrast, in many non-mediated contexts, the concept of speakers\' rights of access frequently is defined in terms of the speaker\'s own First Amendment right of access to audiences. This paper explores the distinctions between these differing interpretations of a speaker\'s access rights and argues that the concept of a speaker\'s right of access to audiences merits a more prominent position in media regulation and policy. This paper then explores the implications of such a shift in perspective 

In [170]:
# sorted_papers_json = sorted(papers_json, key=lambda x: int(x['paper_id'].replace('-', '')))

## Aggregated session data

We want

- `division`
- `chair_name`
- `chair_affiliation`
- `session_type`
- `paper_count`
- `years`

In [171]:
papers.head()

Unnamed: 0,paper_id,title,paper_type,abstract,number_of_authors,year,session,division
0,2003-0001,Access to the Media Versus Access to Audiences...,Paper,When the issue of speakers' rights of access a...,1.0,2003,,
1,2003-0002,Accounting Episodes as Communicative Practice ...,Paper,In this paper I describe accounting episodes a...,1.0,2003,,
2,2003-0003,Accounts of Single-fatherhood: A case study,Paper,Abstract\nRelying on single-fathers accounts ...,4.0,2003,,
3,2003-0004,A Challenge to the Duel: Socializing Dedicated...,Paper,This paper explores the structural controls av...,1.0,2003,,
4,2003-0005,A chatroom ethnography: Evolution of community...,Paper,"In creating an ethnography about the City, Tex...",1.0,2003,,


In [172]:
for session, group in papers.groupby('session'):
    # groupby excludes rows with nan values
    if session in session_dic:
        session_dic[session]['years'] = list(group.year.unique())
        session_dic[session]['paper_count'] = len(group)
    else:
        dic = {}
        dic['session'] = session
        dic['years'] = list(group.year.unique())
        dic['paper_count'] = len(group)
        dic['session_type'] = np.nan 
        dic['chair_name'] = np.nan 
        dic['chair_affiliation'] = np.nan
        try:
            dic['division'] = group.division.unique()[0]
        except:
            dic['division'] = np.nan
        session_dic[session] = dic

In [173]:
sessions_json = list(session_dic.values())

In [174]:
sessions_json[-1]

{'session': '“I’m Ready for My Close-Up”: Representations of Women and Gender on Reality Television (Panel Session)',
 'years': [2013],
 'paper_count': 4,
 'session_type': nan,
 'chair_name': nan,
 'chair_affiliation': nan,
 'division': 'Mass Communication'}

## Aggregated author data

In [179]:
authors_json = []
for author_name, group in authors.groupby('Author Name'):
    # sort by year to make sure affs are in temporal order 
    group.sort_values('Year', ascending=True, inplace=True)
    paper_ids = list(group['Paper ID'].unique())
    affs = group['Author Affiliation'].unique()
    years = list(group['Year'].unique())
    dic = {
        'author_name': author_name,
        'attend_count': len(years),
        'paper_count': len(paper_ids),
        'paper_ids': paper_ids,
        'affiliation': " -> ".join(map(str, affs)),
        'years_attended': years,
    }
    authors_json.append(dic)

In [183]:
len(authors_json)

21038

In [180]:
authors_json[1001]

{'author_name': 'Andrew Kamau',
 'attend_count': 1,
 'paper_count': 1,
 'paper_ids': ['2017-0020'],
 'affiliation': 'Code for Africa',
 'years_attended': [2017]}

In [177]:
# pd.DataFrame(dicts[0:100]).sort_values(by='attend_count', ascending=False)

In [184]:
sorted_authors_json = sorted(authors_json, key=lambda x: x['attend_count'], reverse=True)

In [188]:
sorted_authors_json[-1]

{'author_name': 'Åsa Kroon',
 'attend_count': 1,
 'paper_count': 1,
 'paper_ids': ['2007-1782'],
 'affiliation': 'Örebro U',
 'years_attended': [2007]}