# Pandas DataFrame operations

This is a collection of examples on what to do with pandas data frames (df).

Things to remember when working with data frames:
* Difference between row index and row number. Row index is just like db primary key - it does not change if you delete rows or perform other operations on df. You can access rows in the df either by their index, or by their row number (i.e. like a regular array).
* Difference between slicing the df, and creating a sub-set copy of the df. Ex. when selecting a subset of the original df with fewer columns like this `new_df = original_df[['Col1', 'Col3']]`, new_df is a slice of the original df, not a copy. So adding new columns to it will produce an error.
* __lambda__ is your friend. You can do most of the data manipulations with .apply() and lambda.
* There are a lot of out-of-the-box df summary functions.



In [2]:
import pandas as pd

## Initialization / population

In [479]:
# Empty df
pd.DataFrame()

In [480]:
# Empty dataframe with heading
pd.DataFrame(columns=['A','B','C', 'D'])


Unnamed: 0,A,B,C,D


In [481]:
pd.DataFrame({"A": range(3)})

Unnamed: 0,A
0,0
1,1
2,2


In [482]:
# Test df with values supplied in the code and custom index
values1 = ['VW', 'Toyota', 'Tesla', 'VW']
values2 = ['Beetle', 'Corolla', 'Model S', 'Rabbit']
values3 = [1972, 2005, 2016, 2009]
df = pd.DataFrame({'Company':values1, 'Model':values2, 'Year':values3}, index=[2,4,6,8])
df

Unnamed: 0,Company,Model,Year
2,VW,Beetle,1972
4,Toyota,Corolla,2005
6,Tesla,Model S,2016
8,VW,Rabbit,2009


#### Re-set indexes

In [483]:
df.reset_index(inplace=True, drop=True)
    # inplace=True is to re-set the index on the dataframe. If not specified, then need to assign this
    #        operation to a new var to keep the changes.
    # if drop=True is not specified, then the new index will also be added as a new column
df

Unnamed: 0,Company,Model,Year
0,VW,Beetle,1972
1,Toyota,Corolla,2005
2,Tesla,Model S,2016
3,VW,Rabbit,2009


#### Read Excel, way 1
This is a little faster, than way2

In [484]:
excel_obj = pd.ExcelFile("data/test.xlsx")
df = excel_obj.parse('Sheet1')  # sheet name
df.head(3)

Unnamed: 0,taxon_name,current_taxon_name,authors,source
0,Abortiporus biennis,Abortiporus biennis,(Bull.) Singer,MycoBank
1,Polyporus biennis,Abortiporus biennis,(Bulliard) Fries,MycoBank
2,Absidia anomala,Absidia anomala,Hesseltine & J.J. Ellis,MycoBank


#### Read Excel, way 2:
A little slower than way1

In [485]:
df = pd.read_excel('data/test.xlsx', sheet_name='Sheet1')
df.head(3)

Unnamed: 0,taxon_name,current_taxon_name,authors,source
0,Abortiporus biennis,Abortiporus biennis,(Bull.) Singer,MycoBank
1,Polyporus biennis,Abortiporus biennis,(Bulliard) Fries,MycoBank
2,Absidia anomala,Absidia anomala,Hesseltine & J.J. Ellis,MycoBank


### ===

* Write to excel example is at the end of the notebook.
* Load data from json query example is also at the end of the notebook.

## Accessing elements of data frame

Simple operations. For more complex, like slicing, see below.

In [486]:
# Test df
values1 = ['VW', 'Toyota', 'Tesla', 'VW']
values2 = ['Beetle', 'Corolla', 'Model S', 'Rabbit']
values3 = [1972, 2005, 2016, 2009]
df = pd.DataFrame({'Company':values1, 'Model':values2, 'Year':values3}, index=[2,4,6,8])
df

Unnamed: 0,Company,Model,Year
2,VW,Beetle,1972
4,Toyota,Corolla,2005
6,Tesla,Model S,2016
8,VW,Rabbit,2009


In [21]:
#### Check if df is empty
df.empty

False

In [487]:
# Access one cell by index and column name
df.loc[2,"Model"]

'Beetle'

In [488]:
# Access one cell by row number and column name
df.iloc[0]["Model"]

'Beetle'

#### Get rows by row number

In [489]:
# Get first rows (same as df.head(#))
df[:3]

Unnamed: 0,Company,Model,Year
2,VW,Beetle,1972
4,Toyota,Corolla,2005
6,Tesla,Model S,2016


In [490]:
# last rows
df[-4:] 

Unnamed: 0,Company,Model,Year
2,VW,Beetle,1972
4,Toyota,Corolla,2005
6,Tesla,Model S,2016
8,VW,Rabbit,2009


In [491]:
# View rows from to
df[3:5] 

Unnamed: 0,Company,Model,Year
8,VW,Rabbit,2009


#### Get rows by df index

In [492]:
df.loc[2]

Company        VW
Model      Beetle
Year         1972
Name: 2, dtype: object

In [493]:
df.loc[3:10]  # Note that even though neither of the specified indexes exist in the df
              # this still returns those indexes that are inbetween the specified.

Unnamed: 0,Company,Model,Year
4,Toyota,Corolla,2005
6,Tesla,Model S,2016
8,VW,Rabbit,2009


#### Iterating

In [494]:
# Iterate over a column
for md in df['Model']:
    print(md)

Beetle
Corolla
Model S
Rabbit


In [495]:
### Iterate over rows in a data frame
for index, row in df.iterrows():
    print('Index: {}\nRow: {}\n'.format(index,row))

Index: 2
Row: Company        VW
Model      Beetle
Year         1972
Name: 2, dtype: object

Index: 4
Row: Company     Toyota
Model      Corolla
Year          2005
Name: 4, dtype: object

Index: 6
Row: Company      Tesla
Model      Model S
Year          2016
Name: 6, dtype: object

Index: 8
Row: Company        VW
Model      Rabbit
Year         2009
Name: 8, dtype: object



## Add data
Note that adding one row at a time to df is computationally expensive, so better to use other datastructures to collect data and add it all in one go.

#### Merge two dfs

In [18]:
df1 = pd.DataFrame({'Company':['Jeep', 'Chevrolet'], 'Model':['Cherokee', 'Impala'], 'Year':[2007, 2004]})
df2 = pd.DataFrame({'Company':['Tesla', 'VW'], 'Model':['S', 'Rabbit'], 'Year':[2016, 2009]})

In [19]:
# Append one df to the end of another
# Note that indexes are the same as in original df. Use ignore_index=True to re-set them
df = df1.append(df2)   ### Need to assign to new var, otherwise the changes are lost
df

Unnamed: 0,Company,Model,Year
0,Jeep,Cherokee,2007
1,Chevrolet,Impala,2004
0,Tesla,S,2016
1,VW,Rabbit,2009


In [20]:
# Add lists of values to df
# Idea: create a new df with the lists of values and add it to the existing df
compList = ["VW", "Toyota"]
modelList = ["Tiguan", "Sienna"]
yearList = [2017, 2018]
df = df.append(pd.DataFrame({'Company':compList, 'Model':modelList, 'Year':yearList}), ignore_index=True)
df

Unnamed: 0,Company,Model,Year
0,Jeep,Cherokee,2007
1,Chevrolet,Impala,2004
2,Tesla,S,2016
3,VW,Rabbit,2009
4,VW,Tiguan,2017
5,Toyota,Sienna,2018


#### Modify a value in the df

In [498]:
# Based on index. 
# Note that because we have duplicate indexes in the df, 2 rows have changes 'Year' value to 2010.
df.at[1,'Year'] = 2010
df

Unnamed: 0,Company,Model,Year
0,Jeep,Cherokee,2007
1,Chevrolet,Impala,2010
0,Tesla,S,2016
1,VW,Rabbit,2010


#### Add row

In [499]:
# Append a new row to the end of df
# Note that in this case ignore_index=True is necessary, unless the series has a name.
df = df.append({'Company':'Nissan', 'Model':'Leaf', 'Year':2017}, ignore_index=True)
df

Unnamed: 0,Company,Model,Year
0,Jeep,Cherokee,2007
1,Chevrolet,Impala,2010
2,Tesla,S,2016
3,VW,Rabbit,2010
4,Nissan,Leaf,2017


In [500]:
# Add row by index.
# Note that if that index already exists, this will replace the old row with the new one.
df.loc[3] = ['Subaru','Forrester', 2010]  # Adds this row INSTEAD OF 4th row in data frame
df.head()

Unnamed: 0,Company,Model,Year
0,Jeep,Cherokee,2007
1,Chevrolet,Impala,2010
2,Tesla,S,2016
3,Subaru,Forrester,2010
4,Nissan,Leaf,2017


#### Add column

In [501]:
df['Year_Squared'] = df['Year']**2
df

Unnamed: 0,Company,Model,Year,Year_Squared
0,Jeep,Cherokee,2007,4028049
1,Chevrolet,Impala,2010,4040100
2,Tesla,S,2016,4064256
3,Subaru,Forrester,2010,4040100
4,Nissan,Leaf,2017,4068289


In [502]:
# Alternative way, with specifying the locationof the column
df.insert(loc=1, column="New Col", value="la")
df

Unnamed: 0,Company,New Col,Model,Year,Year_Squared
0,Jeep,la,Cherokee,2007,4028049
1,Chevrolet,la,Impala,2010,4040100
2,Tesla,la,S,2016,4064256
3,Subaru,la,Forrester,2010,4040100
4,Nissan,la,Leaf,2017,4068289


#### Add column and add values to this column with an index

Use case: we want to add a "rating" column and add rating for Toyota (5) and Jeep (3). Note that because we have deleted a row, the index is missing value 2.

In [428]:
from pandas import *
idx = pd.Int64Index([4, 8])
tmp_df = DataFrame(index = idx, data =({'rating':[5,3]}))
rating_df = df.join(tmp_df)
rating_df

Unnamed: 0,Company,Model,Year,Year_Squared,rating
0,Jeep,Cherokee,2007,4028049,
1,Chevrolet,Impala,2010,4040100,
2,Tesla,S,2016,4064256,
3,Subaru,Forrester,2010,4040100,
4,Nissan,Leaf,2017,4068289,5.0


## Delete data

#### Delete column

In [429]:
del df['Year_Squared']
df

Unnamed: 0,Company,Model,Year
0,Jeep,Cherokee,2007
1,Chevrolet,Impala,2010
2,Tesla,S,2016
3,Subaru,Forrester,2010
4,Nissan,Leaf,2017


##### Delete several columns by name

In [430]:
del_df = pd.DataFrame({"col1":range(3), "col2":range(3), "col3":range(3)})
del_df

Unnamed: 0,col1,col2,col3
0,0,0,0
1,1,1,1
2,2,2,2


In [431]:
del_df = del_df.drop(["col1", "col2"], axis=1)  # need to assign it back to the df,
                                                # otherwise changes are lost
del_df

Unnamed: 0,col3
0,0
1,1
2,2


#### Delete row

In [432]:
df.drop(df.index[[2]])  # drops the row indexed as 2. 
                        # Note: once it is dropped, the index 2 is gone. The rest of the indexes are not shifted

Unnamed: 0,Company,Model,Year
0,Jeep,Cherokee,2007
1,Chevrolet,Impala,2010
3,Subaru,Forrester,2010
4,Nissan,Leaf,2017


### Drop rows with empty values

##### Add another col with empy data for an example

In [433]:
tmp_df = DataFrame(index = pd.Int64Index([0, 4]), data =({'some_col':[0,0]}))
tmp_df = rating_df.join(tmp_df)
tmp_df.at[1,'Company'] = float('nan')
tmp_df

Unnamed: 0,Company,Model,Year,Year_Squared,rating,some_col
0,Jeep,Cherokee,2007,4028049,,0.0
1,,Impala,2010,4040100,,
2,Tesla,S,2016,4064256,,
3,Subaru,Forrester,2010,4040100,,
4,Nissan,Leaf,2017,4068289,5.0,0.0


##### Drop empty values (note that everything but Jeep is dropped, because dropna works on all columns)

In [434]:
tmp_df.dropna()

Unnamed: 0,Company,Model,Year,Year_Squared,rating,some_col
4,Nissan,Leaf,2017,4068289,5.0,0.0


#### To "drop" only based on one column, use select non-empty instead

In [435]:
tmp_df[pd.notnull(tmp_df['rating'])]

Unnamed: 0,Company,Model,Year,Year_Squared,rating,some_col
4,Nissan,Leaf,2017,4068289,5.0,0.0


#### Alternative way to drop rows with  empty values
Background: sometimes when nan is read from excel, then the .notnull or .dropna doesn't work on it, since it is a different data type. One example that I have encountered, this is what worked to remove those "nan values". Note that here I could not reproduce the same issue, since .notnull will work in this example, but try this if it does not work (and update the example :)

In [436]:
tmp_df

Unnamed: 0,Company,Model,Year,Year_Squared,rating,some_col
0,Jeep,Cherokee,2007,4028049,,0.0
1,,Impala,2010,4040100,,
2,Tesla,S,2016,4064256,,
3,Subaru,Forrester,2010,4040100,,
4,Nissan,Leaf,2017,4068289,5.0,0.0


In [437]:
# Don't forget to assign it to a var to keep the changes
tmp_df[tmp_df["Company"].apply(lambda col_value: isinstance(col_value, str))]

Unnamed: 0,Company,Model,Year,Year_Squared,rating,some_col
0,Jeep,Cherokee,2007,4028049,,0.0
2,Tesla,S,2016,4064256,,
3,Subaru,Forrester,2010,4040100,,
4,Nissan,Leaf,2017,4068289,5.0,0.0


## Exploring / viewing df structure. Summarizing data.

In [438]:
# Load test df
df = pd.read_excel('data/test.xlsx', sheet_name='Sheet1')

In [439]:
# Set number of rows to view (viewable rows)
pd.set_option('display.max_rows', 7)
df

Unnamed: 0,taxon_name,current_taxon_name,authors,source
0,Abortiporus biennis,Abortiporus biennis,(Bull.) Singer,MycoBank
1,Polyporus biennis,Abortiporus biennis,(Bulliard) Fries,MycoBank
2,Absidia anomala,Absidia anomala,Hesseltine & J.J. Ellis,MycoBank
...,...,...,...,...
202,Helminthosporium californicum,Bipolaris sorokiniana,Mackie & G.E. Paxton,ICTF
203,Ophiobolus sativus,Bipolaris sorokiniana,S. Ito & Kurib.,ICTF
204,Cochliobolus sativus,Bipolaris sorokiniana,(S. Ito & Kurib.) Drechsler ex Dastur,ICTF


In [440]:
# See column names
df.columns

Index(['taxon_name', 'current_taxon_name', 'authors', 'source'], dtype='object')

In [441]:
# See column data types
df.dtypes

taxon_name            object
current_taxon_name    object
authors               object
source                object
dtype: object

In [442]:
# Dimentions of the df
df.shape

(205, 4)

In [443]:
# Summarize structure and data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 4 columns):
taxon_name            205 non-null object
current_taxon_name    205 non-null object
authors               203 non-null object
source                205 non-null object
dtypes: object(4)
memory usage: 6.5+ KB


In [444]:
# Summarize data in the df
df.describe()

Unnamed: 0,taxon_name,current_taxon_name,authors,source
count,205,205,203,205
unique,196,76,167,3
top,Alternaria brassicae,Plenodomus lingam,Frisch & G. Thor,ICTF
freq,4,16,6,157


In [445]:
# Summarize data with transpose view
df.describe().transpose()

Unnamed: 0,count,unique,top,freq
taxon_name,205,196,Alternaria brassicae,4
current_taxon_name,205,76,Plenodomus lingam,16
authors,203,167,Frisch & G. Thor,6
source,205,3,ICTF,157


In [446]:
# View top rows
df.head()       # default is 5 rows
df.head(2)     # set a number of rows to view

Unnamed: 0,taxon_name,current_taxon_name,authors,source
0,Abortiporus biennis,Abortiporus biennis,(Bull.) Singer,MycoBank
1,Polyporus biennis,Abortiporus biennis,(Bulliard) Fries,MycoBank


In [447]:
# See unique values of a column
unique_values = df['taxon_name'].unique()
unique_values.sort()
unique_values[:3]  # view first 3 of the array

array(['Abortiporus biennis', 'Absidia anomala', 'Absidia blakesleeana'],
      dtype=object)

## Data and structure manipulations

### Slicing
I.e. creating a subset of a data frame based of different requirements. Note that this creates a "slice" of the original df, i.e. not a separate copy of the df. Even if you assign it to a new variable, what you get is a named  view of the original df. So if you want to add columns / values to the slice, you will get an error.

In [448]:
# Using the same test df as in the previous section
df.head(3)

Unnamed: 0,taxon_name,current_taxon_name,authors,source
0,Abortiporus biennis,Abortiporus biennis,(Bull.) Singer,MycoBank
1,Polyporus biennis,Abortiporus biennis,(Bulliard) Fries,MycoBank
2,Absidia anomala,Absidia anomala,Hesseltine & J.J. Ellis,MycoBank


#### Subset based on cell value
Usecase: want to select all the rows that have 'ICTF' as a value in the source column

In [449]:
df_sub = df.loc[df['source']=='ICTF']
#df_sub.head()
len(df_sub)

157

In [450]:
# Subset based on the set of cell values
values = ['ICTF', 'MycoBank']
df_sub = df.loc[df['source'].isin(values)]
len(df_sub)
#df_sub.head()

184

#### Subset of selected columns

In [451]:
species_disease_df = df[['taxon_name','source']]
species_disease_df.head()

Unnamed: 0,taxon_name,source
0,Abortiporus biennis,MycoBank
1,Polyporus biennis,MycoBank
2,Absidia anomala,MycoBank
3,Apophysomyces atrospora,MycoBank
4,Absidia blakesleeana,MycoBank


#### All values from one column

In [452]:
species_sr = df['taxon_name']
species_sr.head(3)   # .head() also works in series

0    Abortiporus biennis
1      Polyporus biennis
2        Absidia anomala
Name: taxon_name, dtype: object

#### Rename columns

In [453]:
df1 = pd.DataFrame({'old_1':[1,2], 'old_2':[3,4]})
df1

Unnamed: 0,old_1,old_2
0,1,3
1,2,4


In [454]:
df1.columns = ['new_1', 'new_2']
df1

Unnamed: 0,new_1,new_2
0,1,3
1,2,4


#### Reorder columns

In [473]:
df.head(3)

Unnamed: 0,taxon_name,current_taxon_name,authors,source
0,Abortiporus biennis,-,(Bull.) Singer,MycoBank
1,Polyporus biennis,Abortiporus biennis,(Bulliard) Fries,MycoBank
2,Absidia anomala,-,Hesseltine & J.J. Ellis,MycoBank


In [474]:
df = df[['current_taxon_name', 'taxon_name', 'source', 'authors']]
df.head(3)

Unnamed: 0,current_taxon_name,taxon_name,source,authors
0,-,Abortiporus biennis,MycoBank,(Bull.) Singer
1,Abortiporus biennis,Polyporus biennis,MycoBank,(Bulliard) Fries
2,-,Absidia anomala,MycoBank,Hesseltine & J.J. Ellis


In [476]:
# or with sorting:
df = df.sort_index(axis=1)
df.head()

Unnamed: 0,authors,current_taxon_name,source,taxon_name
0,(Bull.) Singer,-,MycoBank,Abortiporus biennis
1,(Bulliard) Fries,Abortiporus biennis,MycoBank,Polyporus biennis
2,Hesseltine & J.J. Ellis,-,MycoBank,Absidia anomala
3,H. Naganishi & Hirahara,,IndexFungorum,Absidia anomala
4,Lendner,Lichtheimia hyalospora,MycoBank,Absidia blakesleeana


### Find and drop rows where two columns have the same value
Scenario: want to drop the rows where two columns have the same value.
For this I need to use an .apply(). apply is used to send a column to a function. To send a row to a function need to add axis=1 to apply parameters.

In [455]:
# Want to remove second row
values1 = ['Something', 'Something1', 'DUDU', 'VW']
values2 = ['Other', 'Something1', 'Lala', 'Rabbit']
dupl_df = pd.DataFrame({'col1':values1, 'col2':values2})
dupl_df

Unnamed: 0,col1,col2
0,Something,Other
1,Something1,Something1
2,DUDU,Lala
3,VW,Rabbit


In [456]:
# Don't forget to assign this to a data frame, if you want to keep this 
dupl_df[dupl_df.apply(lambda row: row["col1"]!=row["col2"], axis=1)]

Unnamed: 0,col1,col2
0,Something,Other
2,DUDU,Lala
3,VW,Rabbit


In [457]:
dupl_df

Unnamed: 0,col1,col2
0,Something,Other
1,Something1,Something1
2,DUDU,Lala
3,VW,Rabbit


### Pre-pend every value in a column with a string

In [458]:
df = pd.DataFrame({'col1':['a', 'b', 'c'], 'col': range(3)})
df

Unnamed: 0,col1,col
0,a,0
1,b,1
2,c,2


In [459]:
df['col1'] = df['col1'].apply(lambda cell_value: "Row "+cell_value)
df

Unnamed: 0,col1,col
0,Row a,0
1,Row b,1
2,Row c,2


### Find and explore duplicates

In [460]:
df = pd.read_excel("data/duplicates.xlsx")
df = df.head(7)
df

Unnamed: 0,taxon_name,current_taxon_name,authors,source
0,Abortiporus biennis,-,(Bull.) Singer,MycoBank
1,Polyporus biennis,Abortiporus biennis,(Bulliard) Fries,MycoBank
2,Absidia anomala,-,Hesseltine & J.J. Ellis,MycoBank
3,Absidia anomala,,H. Naganishi & Hirahara,IndexFungorum
4,Absidia blakesleeana,Lichtheimia hyalospora,Lendner,MycoBank
5,Absidia californica,Absidia californica,J.J. Ellis & Hesseltine,MycoBank
6,Abortiporus biennis,Absidia coerulea,Bainier,IndexFungorum


In [461]:
df_dupl = df.duplicated('taxon_name', keep=False)
df_dupl.head()

0     True
1    False
2     True
3     True
4    False
dtype: bool

In [462]:
df.insert(loc=4, column='taxon_name duplicated', value=df_dupl)
df.head()

Unnamed: 0,taxon_name,current_taxon_name,authors,source,taxon_name duplicated
0,Abortiporus biennis,-,(Bull.) Singer,MycoBank,True
1,Polyporus biennis,Abortiporus biennis,(Bulliard) Fries,MycoBank,False
2,Absidia anomala,-,Hesseltine & J.J. Ellis,MycoBank,True
3,Absidia anomala,,H. Naganishi & Hirahara,IndexFungorum,True
4,Absidia blakesleeana,Lichtheimia hyalospora,Lendner,MycoBank,False


In [463]:
unique_names = df['taxon_name'].unique()
unique_names.sort()
unique_names

array(['Abortiporus biennis', 'Absidia anomala', 'Absidia blakesleeana',
       'Absidia californica', 'Polyporus biennis'], dtype=object)

In [464]:
for name in unique_names:
    group = df.loc[df['taxon_name'] == name]
    new_group = group.drop_duplicates('current_taxon_name')
    print(new_group)
    print('--------------------')

            taxon_name current_taxon_name         authors         source  \
0  Abortiporus biennis                  -  (Bull.) Singer       MycoBank   
6  Abortiporus biennis   Absidia coerulea         Bainier  IndexFungorum   

   taxon_name duplicated  
0                   True  
6                   True  
--------------------
        taxon_name current_taxon_name                  authors         source  \
2  Absidia anomala                  -  Hesseltine & J.J. Ellis       MycoBank   
3  Absidia anomala                NaN  H. Naganishi & Hirahara  IndexFungorum   

   taxon_name duplicated  
2                   True  
3                   True  
--------------------
             taxon_name      current_taxon_name  authors    source  \
4  Absidia blakesleeana  Lichtheimia hyalospora  Lendner  MycoBank   

   taxon_name duplicated  
4                  False  
--------------------
            taxon_name   current_taxon_name                  authors  \
5  Absidia californica  Absidia cal

In [465]:
# Convert column format
#pd.to_numeric(agr_land_area['Value'])

### Remove duplicates

In [466]:
df = pd.read_excel("data/duplicates.xlsx")
df

Unnamed: 0,taxon_name,current_taxon_name,authors,source
0,Abortiporus biennis,-,(Bull.) Singer,MycoBank
1,Polyporus biennis,Abortiporus biennis,(Bulliard) Fries,MycoBank
2,Absidia anomala,-,Hesseltine & J.J. Ellis,MycoBank
...,...,...,...,...
15,Absidia griseola,Absidia griseola,H. Naganishi & Hirahara,MycoBank
16,Absidia hesseltinei,Lichtheimia corymbifera,B.S. Mehrotra,MycoBank
17,Abortiporus biennis,Absidia heterospora,Y. Ling,ICTF


In [467]:
df.drop_duplicates(subset='taxon_name', keep='last')

Unnamed: 0,taxon_name,current_taxon_name,authors,source
1,Polyporus biennis,Abortiporus biennis,(Bulliard) Fries,MycoBank
3,Absidia anomala,,H. Naganishi & Hirahara,IndexFungorum
4,Absidia blakesleeana,Lichtheimia hyalospora,Lendner,MycoBank
...,...,...,...,...
15,Absidia griseola,Absidia griseola,H. Naganishi & Hirahara,MycoBank
16,Absidia hesseltinei,Lichtheimia corymbifera,B.S. Mehrotra,MycoBank
17,Abortiporus biennis,Absidia heterospora,Y. Ling,ICTF


### Drop duplicates 2

Drop row where value in one column is duplicated

In [6]:
df_dup = pd.DataFrame({'Column1': ["'cat'", "'toy'", "'cat'"],
                   'Column2': ["'bat'", "'flower'", "'bat'"],
                   'Column3': ["'xyz'", "'abc'", "'lmn'"]})
df_dup

Unnamed: 0,Column1,Column2,Column3
0,'cat','bat','xyz'
1,'toy','flower','abc'
2,'cat','bat','lmn'


In [7]:
df_dup.drop_duplicates(['Column1'], keep='first')

Unnamed: 0,Column1,Column2,Column3
0,'cat','bat','xyz'
1,'toy','flower','abc'


In [8]:
# OR, based on 2 columns
df_dup.drop_duplicates(['Column1', 'Column2'], keep='first')

Unnamed: 0,Column1,Column2,Column3
0,'cat','bat','xyz'
1,'toy','flower','abc'


## Load data from json query

#### Read json from URL, way 1
This is brittle, since it's using local solr url as an example. 
TODO: find an accessible json qry

In [468]:
query_url = 'http://localhost:8983/solr/CFIA_all/select?fl=id&q=title:grain'

In [469]:
'''
df2 = pd.read_json(query_url)
df2.head()
'''

'\ndf2 = pd.read_json(query_url)\ndf2.head()\n'

#### Read json from url, way 1

In [470]:
import requests
'''
query_url = 'http://localhost:8983/solr/CFIA_all/select?fl=id&q=title:grain'
r = requests.get(query_url)
query_response_df = pd.DataFrame(r.json()['response']['docs'])
query_response_df.head()
'''

"\nquery_url = 'http://localhost:8983/solr/CFIA_all/select?fl=id&q=title:grain'\nr = requests.get(query_url)\nquery_response_df = pd.DataFrame(r.json()['response']['docs'])\nquery_response_df.head()\n"

#### Read json from url, way 2 (no requests)
Note that this method does not use request or any other library, but pandas. The downside is that it reads into dataframe exactly what the url returned, so headers and everything. If you need to pre-process the json before, then the requests way above is a better option.

In [471]:
'''
df2 = pd.read_json(query_url)
df2
'''

'\ndf2 = pd.read_json(query_url)\ndf2\n'

## Write data to an excel file

In [None]:
writer = pd.ExcelWriter('/path/to/file/dataframe.xlsx')
df.to_excel(excel_writer=writer, sheet_name='test_sheet')
writer.save()