# Introduction to Pandas

# Pandas key features

Fast and efficient DataFrame object with default and customized indexing.

Tools for loading data into in-memory data objects from different file formats.

Data alignment and integrated handling of missing data.

Reshaping and pivoting of date sets.

Label-based slicing, indexing and subsetting of large data sets.

Columns from a data structure can be deleted or inserted.

Group by data for aggregation and transformations.

High performance merging and joining of data.

Time Series functionality.

### INSTALL:

python -m pip install pandas



## Pandas & dataframes

We will go through Pandas via examples, bases on WOS data

Pandas allows to load data into objects named DATAFRAMES or SERIES

Series: a pandas Series is a one dimensional data structure (“a one dimensional ndarray”) that can store values — and for every value it holds a unique index, too.

DataFrame: a pandas DataFrame is a two (or more) dimensional data structure – basically a table with rows and columns. The columns have names and the rows have indexes.

Also PANEL type object (container of dataframes) exists.

First example creates a random df where index is a progressive number

The second example uses a list of dates as index;




In [None]:
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(6, 4),  columns=list('ABCD'))

df


In [None]:
dates = pd.date_range('20190101', periods=6)

df = pd.DataFrame(np.random.randn(6, 4), index=dates)

df


A DF can be also created by passing a dictionary of objects

In [None]:
df2 = pd.DataFrame({'A': 1.,
                    'B': pd.Timestamp('20130102'),
                    'C': pd.Series(1, index=list(range(4)), dtype='float32'),
                    'D': np.array([3] * 4, dtype='int32'),
                    'E': pd.Categorical(["test", "train", "test", "train"]),
                    'F': 'foo'})

df2

## Importing DFs from files

Pandas can read a Csv file with just one function: read_csv().

quick functions to visualize df:

head() function to print the first five rows.
data.head()

data.tail(3) last 3 rows
data.index shows values and dtype of index
data.columns shows a series object with all column names

an advanced tool is QGRID
qgrid is a widget that allows advanced view of dataframes





In [None]:
import qgrid

data = pd.read_csv("wos_publications.csv")
# data.head()
qgrid_widget = qgrid.show_grid(data,show_toolbar=True)
qgrid_widget


In [None]:

# qgrid_widget.get_changed_df()


Next step is make some basic statistics to understand how dataframes get data

 dataframe.field_name
 
 dataframe['field_name']
 
 dataframe[['field1', ... 'fieldn']]  (argument here is a list [l1,l2..ln]

adding a range [start:end] selects a subset (leading and traling record are excluded)


In [None]:
data.abstract[:10]

In [None]:
data[['UID', 'pubyear']][3:7]

#### isna() dropna()

Another useful feature .isna() allows to display fields which have null values

if we also add .sum() we can have a quick statistic showing which fields have null value 
and how many records for each field;

in case we'd need only data where all fields contain value we could apply 

dataset = dataset.dropna()

In [None]:
data.isna().sum()

## data selection: 

loc, iloc, boolean indexing and isin

are function that allow to access data

loc: Access a group of rows and columns by label(s)
iloc: Purely integer-location based indexing for selection by position.
df[df.column + condition] --> retuns only records where condition is true
df[df.column.isin(['val1,'val2'...'valn'])] returns records where column values are in the list 




In [None]:
data.loc[1:5, ['pubyear', 'doc_type']] # records 1 to 5 year and type of publication

In [None]:
data.iloc[1:6, [8, 5]] # same as above

In [None]:
data[data.pubyear<1999][:3]  # first 3 articles with year < 1999

In [None]:
# if we want to search some text...

data[data.source.str.contains('STUDIES')].head()


In [None]:
data[data['pubyear'].isin([1999])][:3]  # 3 records of 1999

Or selecting articles titles in journals of management, dropping duplicates with function

drop_duplicates()

In [None]:
journals = ['MANAGEMENT SCIENCE', 'STRATEGIC MANAGEMENT JOURNAL']
mgt = data[data['source'].isin(journals)]
mgt['itemtitle'].drop_duplicates()

### Some basic statistic functions:

As in many other languages you can use max() min() mean() median() count() sum()
like: dataframe.column.function()  (ie : data.pubyear.min() )

describe() shows a quick statistic summary of your data

groupby(column) creates an object with stats for group;
count() function allows to show the total of non null ('NaN') fields

Other useful methods are 

rank() Data Ranking produces ranking for each element in the array of elements.

corr() Correlation shows the linear relationship between any two array of value.

cov() The Series object has a method cov to compute covariance between series objects.

pct_change() This function compares every element with its prior element and computes the change percentage.

rolling(window = n).funct() : applies to a rolling window a given function 

In [None]:
data.describe()

In [None]:
data.groupby('pubyear').count()

Next step is to select count only for one fields
to make things easier we create a new DF with group by data
and we show content for UID field

In [None]:
grouped = data.groupby('pubyear')
print (grouped['UID'].count())

In [None]:
data['pubyear'].value_counts()

## Drawing

By using matplotlib now we draw the trend of number of articles by year


In [None]:
import matplotlib.pyplot as plt
%pylab inline
grouped['UID'].count().plot()
plt.show()

## Sort and transpose

DF can be sorted by index or by values

.T allows also transposing

df.sort_index(axis=1, ascending=False, inplace=True)   [default makes a copy]
sorts objects by labels along the given axis
axis : index (0), or  columns names (1) to direct sorting


df.sort_values(by='column name', ascending=False)
sorts all dataset by columns values

Note: when sorting by values the inndex could be no more ordered, you can apply method .reset_index()



In [None]:
a=pd.DataFrame(grouped['UID'].count())
a.T

In [None]:
a.sort_values(by='UID', ascending=False)

### Updating Dataframes

Now we want to do some update to the existing dataset

we start by adding a new column PTYPE that contains only one letter instead of the full description of kind of publication
that is contained in column pubtype

then we want to set values as follows:

Journal  = A
Book in series = B

We might do a 'case by case' update using a where condition like
data.loc[df['pubtype'] == 'Journal', 'ptype'] = 'A'




In [None]:
data['ptype']=data['pubtype']

# this case works but should be avoided: use dict instead
# data.loc[df['pubtype'] == 'Journal', 'ptype'] = 'A'
# data.loc[df['pubtype'] == 'Book in series', 'ptype'] = 'B'

d1 = { 'Journal': "A",
       'Book in series': 'B'}

data.replace({'ptype': d1}, inplace=True)  #inplace makes it store; try without

data.head(5)




Similar results could be obtained with 

np.where() 


function, fos instance to set atype = A if journal, else Other 


data['atype'] = np.where(newdata['ptype']=='A', 'Article', 'Other')

Now we use groupby to check how many A and B type publications are in our data
Eventually we also use the function unique() on doc_type to see what subkind of 
publications;
in the last line we extract a count of the copy ptype, doc_type to check also 
doc_type correspondance to publication macro type


In [None]:
data.groupby('ptype')['UID'].count()

In [None]:
data['doc_type'].unique()

In [None]:
data.groupby(['ptype','doc_type'])['UID'].count()

Most of the examples above just show the result of operation without storing it
neither in the existing DF nor in a new one.

We might eventually think of creating a dataset with the count by ptype:

In [None]:

newdata = data.groupby('ptype')['UID'].count()
newdata


You see the new dataframe has ptype as index so we should add 

reset_index() 

function if we want the index column

if you run then newdata.index before and after you will see the descrption of the index

In [None]:
newdata = data.groupby('ptype')['UID'].count().reset_index()
newdata


Eventually we could also decide to add a column with the count to the main dataset

In [None]:
data['count'] = data.groupby('ptype')['UID'].transform('count')
data.head(5)

## Iterating dataframes

iteritems() − to iterate over the (key,value) pairs
is a sort of horizontal scan by field name

iterrows() − iterate over the rows as (index,series) pairs
sans each row by index 

itertuples() − iterate over the rows as namedtuples
returns an object where we have all values and names plus index value


In [None]:
df1= data.iloc[0:2, 0:5]
for key,value in df1.iteritems():
   print (key)
   print (value)
   print ('\n')

In [None]:
for row_index,row in df1.iterrows():
   print (row_index)
   print (row)
   print ('\n')

In [None]:
for row in df1.itertuples():
    print (row.UID)
    print (row)
    

## functions apply()

apply takes a function and applies it to all values of pandas series.


dataframe.apply(func, convert_dtype=True, args=())

func: .apply takes a function and applies it to all values of pandas series.

convert_dtype: Convert dtype as per the function’s operation.

args=(): Additional arguments to pass to function instead of series.

Return Type: Pandas Series after applied function/operation.



In [None]:
def remove_p(text):
    
    try: 
        return text.replace('<p>', '')
    except AttributeError:
        return np.NaN


data['abstract']=data['abstract'].apply(remove_p)
data.head(5)

In [None]:
data.head(5)