< [NTLK](https://tdm.universiteitleiden.nl/Python/NLTK.html) | [Table of contents](https://tdm.universiteitleiden.nl/Python) | [Data visualisation with matplotlib](https://tdm.universiteitleiden.nl/Python/Visualisation.html) >

# Data analyses with pandas


Research projects based on text and data mining typically start with the creation of structured data. During such an initial phase, linear texts written in natural language are converted into discrete data. Once the data are created, research projects can continue to analyse and to visualise these data. 

The results of the data creation stage are often stored in a data file structured according to the Comma Separated Value (CSV) format. An example of a csv file can be found below.











In [None]:
title,tokens,sentences,adjectives,adverbs,verbs
ARoomWithaView,83147,5863,4058,4455,13917
ATaleofTwoCities,165042,7802,9231,7715,24343
HeartofDarkness,44542,2430,2938,2342,6916
Ivanhoe,210928,6245,12663,8360,29230
MobyDick,252594,9982,18578,14207,32773
PrideandPrejudice,143598,5852,7777,9171,23724
SonsandLovers,204126,16218,9630,10853,33534
ThroughtheLookingGlass,36680,2061,1639,2096,6104
TreasureIsland,82769,3734,4054,4361,12302
VanityFair,355446,13224,22002,14988,50865

The various lines in the file above all contain values which are delineated by commas. The first line specifies names for these values. This specific csv file contains data about ten novels. More particularly, it offers information about six aspects: the number of tokens, sentences, adjectives, adverbs and verbs.  

A first activity in the data analysis phase is commonly to read the data that have been created for the research project and to make these available within a computer program for further processing. In Python, the pandas library offers many methods that are usful for working with csv files. Pandas loosely stands for “Python Data Analysis Library”.  The library is available for Python version 2.7 and higher. It was designed specifically to support research in the field of data science.  Pandas contains methods that enable programmers to read data from a wide range of file formats, including csv, tsv (tab separated value), excel and MySQL databases. 

If pandas has been installed successfully on your computer, this library can be imported using the code below. As can be seen, the line below also assigns an alias, which is a short code for the library. This solution has the advantage that the library can be referred to using this brief code. Without this alias, the full name ‘pandas’ would have to be typed in each time a method from this library is needed.   


  

In [None]:
import pandas as pd

## Reading a csv file

If your data that were discussed earlier are saved under the name ‘data.csv’, these data can be made available to your Python program via the read_csv method. 




In [None]:
df = pd.read_csv( ‘data.csv’ )

The data that are read from the csv files are represented as a specific data structure which is known within the context of pandas as a data frame. A data frame can be compared to a regular table in a database, consisting of rows and columns. Using the example that was given, the above code will create a data frame consisting of 10 rows and 6 columns. The first line of the CSV specifies the names of these columns.

Once a data frame has been created, the data can be explored on a basic level using a number of methods. The head() methods prints the first few rows of the data columns. The number of rows can be specified within the brackets.



In [None]:
import pandas as pd

df = pd.read_csv( 'data.csv' )

print( df.shape )
print( df.head(2) )
print( df.columns )
  

In [None]:
As can be seen in the code above, the names of the columns can be printed using the property ‘columns’, and the ‘shape’ property returns information  about the number of rows and columns. Note that ‘columns’ and ‘shape’ are indeed property and not methods. Because of this distinction, these words needs to be used without parentheses.

More advanced statistical information can be calculated using the following methods:


mean()	The mean of the values in each of the columns
median()	The median of the values in each of the columns
max()	For each column, the highest value
min()	For each column, the lowest value
corr()	Correlations between the values in all the columns


These methods all return different numbers for each of the columns in the data frame. Using df.mean(), for instance, would return the following data for the csv file that was given.



## Series

The values in a column of a data frame can be accessed separately using the name of the column. The column names can be found on the first line of the csv file. In order to obtain the data in a specific column, the name of the column must be appended to the name of the data frame using square brackets, as follows.



In [None]:
df = pd.read_csv( 'data.csv' )

print( df['tokens'] )

The variable df[‘tokens’] in the example above is an example of data structure which pandas refers to as a Series. A Series is very similar to a regular Python list. 

Navigating through a data frame

To view all the columns in a data frame, you can make use of iterrows(). This method returns a Series for each row. In the code below, the Series that represent all the columns is given the name ‘column’. When the variable name ‘column’ is used in combination with a column name, this will represent the data in that particular column. The method iterrows() can also return the index of each row. Listing 7.1 is an illustration.



In [None]:
df = pd.read_csv( 'data.csv' )

for index, column in df.iterrows():
    print( column["title"] )
    print( column["tokens"] )


< [NTLK](https://tdm.universiteitleiden.nl/Python/NLTK.html) | [Table of contents](https://tdm.universiteitleiden.nl/Python) | [Data visualisation with matplotlib](https://tdm.universiteitleiden.nl/Python/Visualisation.html) >