# Python versus pandas

In previous courses you learned to process files line by line. You learned to extract information line based. A common way to extract information is using a for loop. In the following example a file named corpus is processed. The file contains *4 milion processing steps of recipes*. In the `python` way example the file is read line by line and the lines containing the word 'tomato' are stored in an array together with the index of the line. In the `pandas` way the file is not read line by line but processed vectorized. This is an important concept of pandas. 

The first line of the file is the header the next three lines are the first three records of the file. 

    ,step
    0,preheat oven f
    1,butter oil inch baking dish
    2,cook penne minute package direction
    


In [1]:
import pandas as pd
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

## The python way

Python for-loops are processed—One instruction, per one data element, per one moment in time, in order to produce one result. The advantage is that it is flexible — you may implement any operation on your data. The drawback is that you need many lines of code and it is not optimized for memory. 

In [2]:
x = [] #create an empty list

#process the file
with open('data/corpus') as f:
    for line in f:
        index, step = line.split(',')
        if 'tomato' in step:
            x.append(f'{index} \t {step}') #tab seperated

#print first five lines of the tomato processing step list
for i in range(5):
    print(x[i], end='')

32 	 add tomato food processor pinch salt puree smooth
33 	 combine onion bell pepper cucumber tomato puree large bowl
59 	 toss greens pound cup carrot grate tomato halve green onion slice large sized bowl
79 	 add tomato sauce hamburger
103 	 add saute onion dice tomato hot red pepper powder stir minute add cup water leave boil


## Vectorization

Based on the definition given by the official Numpy documentation, vectorization is defined as being “able to delegate the task of performing mathematical operations on the array’s contents to optimized, compiled C code.” Instead of looping through rows, columns or elements, this allows us to apply one set of instructions on multiple elements at the same time.


## The Pandas way
A stated above the vectorized implementation is a structure that supports instruction processing per any number of data elements per one moment in time, in order to produce multiple results. The instruction f[f['clean'].str.contains('tomato')] is an example of such an instruction. It extracts all the sentences with the word tomato for all lines in the file f at once. In data processing vectorized instructions are favored over element wise instructions, since vectorized implementations allow paralel CPU usage.

In [3]:
f = pd.read_csv('data/corpus', index_col=0) #read file use the first column as index
x = f[f['step'].str.contains('tomato')]
print(x.head())

                                                  step
32   add tomato food processor pinch salt puree smooth
33   combine onion bell pepper cucumber tomato pure...
59   toss greens pound cup carrot grate tomato halv...
79                          add tomato sauce hamburger
103  add saute onion dice tomato hot red pepper pow...


## Conclusion

In datapocessing we prefer to stick with vectorized operations on pandas dataframes or numpy arrays. We should try to avoid for loops or iterations on rows as much as possible. 