# DataFrames

In [10]:
import pandas as pd
import numpy as np


- Create a dataframe with these details
- ** np.random.randn(5,4) ** as first argument
- ** index = ['A','B','C','D','E'] **
- ** columns= ['F','G','H','I','J'] **

In [11]:
np.random.randn(3,3)

array([[-1.00859008, -0.94515347,  1.46605728],
       [ 1.0107793 ,  0.65685865, -0.52002778],
       [ 0.57583091,  0.78846963, -0.48586689]])

In [12]:
df = pd.DataFrame(np.random.randn(3,3), index=['A','B','C'], columns=['F','G','H'])
df

Unnamed: 0,F,G,H
A,0.665844,0.689546,0.002671
B,-1.494107,0.467895,0.455264
C,0.08984,0.21454,1.158228


- Create a dictionary of equal length lists and pass it to a DataFrame
- data = { "movieid" : [1,2,3,4,5,6,7,8], "moviename": ["Toy Story (1995)", "Jumanji (1995)","Grumpier Old Men (1995)","Waiting to Exhale (1995)","Father of the Bride Part II (1995)","Heat (1995)","Sabrina (1995)","Tom and Huck (1995)"]}

In [13]:
data = { "movieid" : [1,2,3,4,5,6,7,8], 
        "moviename": ["Toy Story (1995)", "Jumanji (1995)",
                      "Grumpier Old Men (1995)",
                      "Waiting to Exhale (1995)",
                      "Father of the Bride Part II (1995)",
                      "Heat (1995)","Sabrina (1995)",
                      "Tom and Huck (1995)"]}
df = pd.DataFrame(data)
df

Unnamed: 0,movieid,moviename
0,1,Toy Story (1995)
1,2,Jumanji (1995)
2,3,Grumpier Old Men (1995)
3,4,Waiting to Exhale (1995)
4,5,Father of the Bride Part II (1995)
5,6,Heat (1995)
6,7,Sabrina (1995)
7,8,Tom and Huck (1995)


- Add an extra column using the **columns** parameter
- Can assign values to the newly created row using scalar assignment ** data['rating'] = 0 **
- Cool thing is, we can assign a named index series to a newly created row and will update the matching rows !

In [14]:
df['rating']=0
df

Unnamed: 0,movieid,moviename,rating
0,1,Toy Story (1995),0
1,2,Jumanji (1995),0
2,3,Grumpier Old Men (1995),0
3,4,Waiting to Exhale (1995),0
4,5,Father of the Bride Part II (1995),0
5,6,Heat (1995),0
6,7,Sabrina (1995),0
7,8,Tom and Huck (1995),0


- Name a the columns and the index (rows) just like we did for series **df.index.name ="XYZ" df.columns.name="ABC"**

In [15]:
df.index.name = 'slno'
df.columns.name = 'Movie Data'
df

Movie Data,movieid,moviename,rating
slno,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,1,Toy Story (1995),0
1,2,Jumanji (1995),0
2,3,Grumpier Old Men (1995),0
3,4,Waiting to Exhale (1995),0
4,5,Father of the Bride Part II (1995),0
5,6,Heat (1995),0
6,7,Sabrina (1995),0
7,8,Tom and Huck (1995),0


## Selection and Indexing

- Various methods to grab data from a DataFrame
- Column values can be used as dictionary keys
- .loc to access by row

- Just get the values df.values
- Check the datatype

In [16]:
df['rating']
row0 = df.loc[0]
row0.index

Index(['movieid', 'moviename', 'rating'], dtype='object', name='Movie Data')

----
- As a dictionary ** df['F'] **
- As dot infix notation ** df.W ** (Not very popular) 

In [17]:
df['rating']

slno
0    0
1    0
2    0
3    0
4    0
5    0
6    0
7    0
Name: rating, dtype: int64

- Also possible to fetch row based data using df.loc['A'] or df.iloc[1]

In [18]:
df = pd.DataFrame(np.random.randn(3,3), index=['A','B','C'], columns=['F','G','H'])
print (df.loc['A'])
print (df.iloc[0])


F    0.293230
G    0.025665
H   -0.826068
Name: A, dtype: float64
F    0.293230
G    0.025665
H   -0.826068
Name: A, dtype: float64


- Can also pass a list of column names ** df[['F','G']] **
- Also check the data types of selecting a single column and selecting multiple columns
- When its a series check the name attribute also !

In [19]:
df[['F','G']]

Unnamed: 0,F,G
A,0.29323,0.025665
B,0.398519,-0.209922
C,0.848919,-1.277502


- Creating a new column through computation
- ** df['new']= df['W'] + df['Y'] **

In [20]:
df['handg'] = df['H'].mean()
df

Unnamed: 0,F,G,H,handg
A,0.29323,0.025665,-0.826068,0.356665
B,0.398519,-0.209922,1.631301,0.356665
C,0.848919,-1.277502,0.264761,0.356665


- Dropping rows and columns
- rows can be dropped by index
- columns can be dropped by mentioning column name and axis = 1
- there is an inplace parameter that can be passed **inplace=True** to effect a change in the underlying DF

In [21]:
print (df)
print (df.drop('A'))
print (df)

          F         G         H     handg
A  0.293230  0.025665 -0.826068  0.356665
B  0.398519 -0.209922  1.631301  0.356665
C  0.848919 -1.277502  0.264761  0.356665
          F         G         H     handg
B  0.398519 -0.209922  1.631301  0.356665
C  0.848919 -1.277502  0.264761  0.356665
          F         G         H     handg
A  0.293230  0.025665 -0.826068  0.356665
B  0.398519 -0.209922  1.631301  0.356665
C  0.848919 -1.277502  0.264761  0.356665


- Selecting subset of rows and columns
- df.loc[['A','B'],['W','Y']]
- df.loc['A','W']

### Conditional Selection

An important feature of pandas is conditional selection using bracket notation, very similar to numpy:
- df > 0
- df[df>0]
- df[df['W']>0]
- df[df['W']>0]['Y']
- df[(df['W']>0) & (df['Y'] > 1)]

For two conditions you can use | and & with parenthesis:

## Working with Index


In [22]:
df

Unnamed: 0,F,G,H,handg
A,0.29323,0.025665,-0.826068,0.356665
B,0.398519,-0.209922,1.631301,0.356665
C,0.848919,-1.277502,0.264761,0.356665


- Reset the index using reset_index()

In [23]:
df.reset_index()
df

Unnamed: 0,F,G,H,handg
A,0.29323,0.025665,-0.826068,0.356665
B,0.398519,-0.209922,1.631301,0.356665
C,0.848919,-1.277502,0.264761,0.356665


- Attach a new index
- Create a list of index labels
- Add the index as a column say 'city'
- use set_index('city')

In [24]:
data = { "movieid" : [1,2,3,4,5,6,7,8], 
        "moviename": ["Toy Story (1995)", "Jumanji (1995)",
                      "Grumpier Old Men (1995)",
                      "Waiting to Exhale (1995)",
                      "Father of the Bride Part II (1995)",
                      "Heat (1995)","Sabrina (1995)",
                      "Tom and Huck (1995)"]}
df = pd.DataFrame(data)
a = df.set_index('movieid')
a.loc[2]

moviename    Jumanji (1995)
Name: 2, dtype: object

- reindex. Reindex helps in rearranging the index by providing missing data

In [25]:
df.reindex(columns=['moviename', 'movieid'])

Unnamed: 0,moviename,movieid
0,Toy Story (1995),1
1,Jumanji (1995),2
2,Grumpier Old Men (1995),3
3,Waiting to Exhale (1995),4
4,Father of the Bride Part II (1995),5
5,Heat (1995),6
6,Sabrina (1995),7
7,Tom and Huck (1995),8


In [26]:
import json
json1=open('json1.ipynb')
db1 = json.load(json1)

In [27]:
db1

{'cells': [{'cell_type': 'markdown',
   'metadata': {},
   'source': ['# Json to a flat DF\n',
    '\n',
    '- In this exercise let us process food_data.json and tries to unravel a nested json into a flat DF to be processed further']},
  {'cell_type': 'code',
   'execution_count': 10,
   'metadata': {'collapsed': True},
   'outputs': [],
   'source': ['import numpy as np\n', 'import pandas as pd']},
  {'cell_type': 'code',
   'execution_count': 11,
   'metadata': {'collapsed': True},
   'outputs': [],
   'source': ['import json']},
  {'cell_type': 'markdown',
   'metadata': {},
   'source': ['- load the json using json.load']},
  {'cell_type': 'code',
   'execution_count': 66,
   'metadata': {'collapsed': True, 'scrolled': True},
   'outputs': [],
   'source': ["db = json.load(open('food_data.json'))"]},
  {'cell_type': 'markdown',
   'metadata': {},
   'source': ['- check the following\n',
    '- number of data items in the dictionary\n',
    '- keys of the dictionary']},
  {'cell_ty