In this lecture, we study how to convert other types of objects into elements of pandas' DataFrame objects. The main focus of our will be on lists, tuples, dictionaries and array-objects. This is often used when it comes to ETL process in business operational settings. 

We first look at how we can convert a list into a 'Series' object and a DataFrame object. Remember that a 'Series' object can have axis labels (instead of just an axis location). We start from lists because the 'DataFrame' object has the closest affinity to lists when it comes to conversion. 

In [1]:
import numpy as np
import pandas as pd

First, we use a single list to convert to a 'Series' object and a column of a DataFrame object. 

In [2]:
list1=[13,53,42.9,43.8,0]
series1=pd.Series(data=list1)
print(series1, type(series1))
df=pd.DataFrame(series1, columns=['list1'])
df

0    13.0
1    53.0
2    42.9
3    43.8
4     0.0
dtype: float64 <class 'pandas.core.series.Series'>


Unnamed: 0,list1
0,13.0
1,53.0
2,42.9
3,43.8
4,0.0


If we have additional lists, we can do the same thing:

In [3]:
list2=[12,543,5,0,0.4]
data=list(zip(list1, list2))
df=pd.DataFrame(data, columns=['l1', 'l2'])
df

Unnamed: 0,l1,l2
0,13.0,12.0
1,53.0,543.0
2,42.9,5.0
3,43.8,0.0
4,0.0,0.4


Alternatively, one can do these two methods below as well:

In [4]:
df_a = pd.DataFrame() #creates a new dataframe that's empty
df_a['l1']=list1
df_a['l2']=list2
print('df:\n', df_a)

series2=pd.Series(data=list2)
df_b=pd.concat([series1,series2],axis=1)
df_b.columns=['l1','l2']
print('df:\n', df_b)

df:
      l1     l2
0  13.0   12.0
1  53.0  543.0
2  42.9    5.0
3  43.8    0.0
4   0.0    0.4
df:
      l1     l2
0  13.0   12.0
1  53.0  543.0
2  42.9    5.0
3  43.8    0.0
4   0.0    0.4


Now let's make the problem a bit harder. Suppose we have a list that contains lists. We want to make a dataset out of it. Notice that there are 2 ways of doing this: either the values in the list are shown as rows, or the values of lists become columns in DataFrame. We show both ways below:

In [5]:
list3=[['a','b','c','d','e'], [1,2,3,4,5], [0.4,23,5,67,76.9034]]
df1=pd.DataFrame(list3, columns=['Col1', 'Col2', 'Col3', 'Col4', 'Col5'])
df1

Unnamed: 0,Col1,Col2,Col3,Col4,Col5
0,a,b,c,d,e
1,1,2,3,4,5
2,0.4,23,5,67,76.9034


In [6]:
df2=pd.DataFrame(list3).transpose()
df2.columns=['C1', 'C2', 'C3']
df2

Unnamed: 0,C1,C2,C3
0,a,1,0.4
1,b,2,23.0
2,c,3,5.0
3,d,4,67.0
4,e,5,76.9034


To convert a column in a 'DataFrame' back to lists, simply use the tolist() method:

In [7]:
df2['C1'].tolist()

['a', 'b', 'c', 'd', 'e']

Now let's talk about tuples conversion. The way we build datasets out of tuples is very similar to the cases when we build out of lists.

In [8]:
tup1=(1.5,2.336,4.6,9.054,7.342)
tup2=('a','b','c','d','e')
Series1=pd.Series(data=tup1) # the pd.Series() method can take on tuples as arguments as well
Series2=pd.Series(data=tup2)
df=pd.concat([Series1,Series2], axis=1)
df.columns=['COL1','COL2']
print(df)

data=[tup1,tup2]
df=pd.DataFrame(data).transpose()
df.columns=['tup1', 'tup2']
df

    COL1 COL2
0  1.500    a
1  2.336    b
2  4.600    c
3  9.054    d
4  7.342    e


Unnamed: 0,tup1,tup2
0,1.5,a
1,2.336,b
2,4.6,c
3,9.054,d
4,7.342,e


Now let's work with dictionaries. Dictionaries are more complicated. Depending on what the dictionary looks like what the final target dataset should look like, Python has varieties of ways of handling them. One simple way is to convert dictionaries into lists and apply what we have learned so far. For example:

In [9]:
d1={'k1':3,'k2':34.3,'k3':54.92}
print('The dictionary d1 looks like this: ', d1)
list_keys=[j for j in d1.keys()]
print(list_keys)
list_values=[i for i in d1.values()]
print(list_values)
list_items=[r for r in d1.items()]
print(list_items)

df3=pd.DataFrame(list_items, columns=['var1', 'var2'])
print('df3:\n',df3)
df4=pd.DataFrame(list_items).transpose()
df4.columns=['pair1','pair2','pair3']
print('df4:\n',df4)

The dictionary d1 looks like this:  {'k1': 3, 'k2': 34.3, 'k3': 54.92}
['k1', 'k2', 'k3']
[3, 34.3, 54.92]
[('k1', 3), ('k2', 34.3), ('k3', 54.92)]
df3:
   var1   var2
0   k1   3.00
1   k2  34.30
2   k3  54.92
df4:
   pair1 pair2  pair3
0    k1    k2     k3
1     3  34.3  54.92


Most of the time we have scenarios like this: we have a dictionary and we want to use the key as the column name and the values as data. Here is how we can achieve it:

In [10]:
d2={'k1': list1, 'k2': list2}
print(d2)
df5=pd.DataFrame(d2)
print(df5)

{'k1': [13, 53, 42.9, 43.8, 0], 'k2': [12, 543, 5, 0, 0.4]}
     k1     k2
0  13.0   12.0
1  53.0  543.0
2  42.9    5.0
3  43.8    0.0
4   0.0    0.4


If we want to convert a given 'DataFrame' back to a dicationary form, wen can do the following:

In [11]:
dic_from_df5=df5.to_dict()
print(dic_from_df5)

{'k1': {0: 13.0, 1: 53.0, 2: 42.9, 3: 43.8, 4: 0.0}, 'k2': {0: 12.0, 1: 543.0, 2: 5.0, 3: 0.0, 4: 0.4}}


When we have multiple dictioaries to be converted to a 'DataFrame' object, here is what we can do (assuming the dimensionality matches eventually for the final 'DataFrame' object):

In [12]:
d1={'k1': [2.5, 3.8, 4, 7.9], 'k2': [2, 6, 0, 9.4]}
d2={'k1': [0.3,3.5,4.5], 'k2': [2.345,6.103,9.863]}

def merge_dicts(dic1,dic2):
    from collections import defaultdict
    dd = defaultdict(list)
    for d in (dic1, dic2):
        for key, value in d.items():
            dd[key].extend(value)
    data=dict(dd) # turning the defaultdict back to a dictionary object
    return data
combined_d=merge_dicts(d1,d2)

df6=pd.DataFrame(combined_d)
df6

Unnamed: 0,k1,k2
0,2.5,2.0
1,3.8,6.0
2,4.0,0.0
3,7.9,9.4
4,0.3,2.345
5,3.5,6.103
6,4.5,9.863


Here is the wrong way:

In [13]:
d1={'k1': [2.5, 3.8, 4, 7.9], 'k2': [2, 6, 0, 9.4]}
d2={'k1': [0.3,3.5,4.5], 'k2': [2.345,6.103,9.863]}
df6=pd.DataFrame([d1,d2])
df6

Unnamed: 0,k1,k2
0,"[2.5, 3.8, 4, 7.9]","[2, 6, 0, 9.4]"
1,"[0.3, 3.5, 4.5]","[2.345, 6.103, 9.863]"


Lastly, we talk about the interchangeability between numpy 'ndarrays' and pandas' 'DataFrame' objects. First, we transform an array to a 'DataFrame' and then we do the other way from 'DataFrame' to an array:

In [14]:
array=np.array([[1,2,3],[3.5,4.7,0.9]], float)
df7=pd.DataFrame(data=array, columns=['col1', 'col2', 'col3'])
print(array, '\n')
print(df7, '\n')

back_to_array=df7.values
print('Back to array: \n', back_to_array)

[[1.  2.  3. ]
 [3.5 4.7 0.9]] 

   col1  col2  col3
0   1.0   2.0   3.0
1   3.5   4.7   0.9 

Back to array: 
 [[1.  2.  3. ]
 [3.5 4.7 0.9]]
