# How to get/create data in Python

## Section

1)[Creating Dataframe With Series](#Creating-Dataframe-With-Series)<br>
2)[Creating Dataframe With List](#Creating-Dataframe-With-List)<br>
3)[Creating Dataframe With Dict](#Creating-Dataframe-With-Dict)<br>
4)[Numpy random.randn](#Numpy-random.randn)<br>
5)[Numpy random.randint](#Numpy-random.randint)<br>
6)[Numpy random.randn](#Numpy-random.randn)<br>
7)[Adding a New Column](#Adding-a-New-Column)<br>
8)[Adding a New Row Without Index](#Adding-a-New-Row-Without-Index)<br>
9)[Adding a New Row With Index](#Adding-a-New-Row-Without-Index)<br>
10)[Merging Two Dataframes](#Merging-Two-Dataframes)<br>
11)[Merging Two Columns](#Merging-Two-Columns)<br>
12)[Deleting-Columns](#Deleting-Columns)<br>
13)[Deleting-Rows](#Deleting-Rows)<br>

## Creating Dataframe With Series

<br>[Top](#Section)

In [64]:
import pandas as pd
import numpy as np
import random

author = ['Jitender', 'Purnima', 'Arpit', 'Jyoti'] 
article = [210, 211, 114, 178] 
auth_series = pd.Series(author) 
article_series = pd.Series(article) 
  
frame = { 'Author': auth_series, 'Article': article_series } 
  
result = pd.DataFrame(frame)
result

Unnamed: 0,Author,Article
0,Jitender,210
1,Purnima,211
2,Arpit,114
3,Jyoti,178


## Creating Dataframe With List

<br>[Top](#Section)

In [171]:
df = pd.DataFrame([[np.random.rand(), 2, int(np.random.randint(1,100,1)), 0],
                    [3, 4, np.random.randn()*100, 1],
                   [int(l) for l in np.random.randint(100,size=(4,1))],
                   [int(l) for l in list(np.random.rand(4,1)*100)]],
                  columns=list('ABCD'))
df

Unnamed: 0,A,B,C,D
0,0.089816,2,41.0,0
1,3.0,4,-12.752525,1
2,66.0,37,76.0,28
3,1.0,25,37.0,4


In [150]:
date = pd.date_range('2015-04-01',periods=3)
date = list(date)*3
date.sort()
account = [1,2,3]*3
account
df_man = pd.DataFrame(np.random.randint(0,100,9), index=date, columns = ['Acct'])
df_man

Unnamed: 0,Acct
2015-04-01,95
2015-04-01,38
2015-04-01,12
2015-04-02,74
2015-04-02,33
2015-04-02,56
2015-04-03,0
2015-04-03,77
2015-04-03,86


## Creating Dataframe With Dict

<br>[Top](#Section)

In [66]:
date = pd.date_range('2015-04-01',periods=3)
date = list(date)*3
date.sort()
account = [1,2,3]*3
df_man = pd.DataFrame({'Index':date, 'Account':account,'Bal':np.random.randint(0,100,9)})
#or
dic = {'Index':date, 'Account':account, 'Bal':np.random.randint(0,100,9)}
pd.DataFrame(dic)

Unnamed: 0,Index,Account,Bal
0,2015-04-01,1,4
1,2015-04-01,2,56
2,2015-04-01,3,17
3,2015-04-02,1,92
4,2015-04-02,2,93
5,2015-04-02,3,37
6,2015-04-03,1,12
7,2015-04-03,2,92
8,2015-04-03,3,40


## Numpy random.randn
The random.randn function generates random numbers that fit nicely inside the pandas dataframe function.
<br>[Top](#Section)

In [67]:
import pandas as pd
import numpy as np
import random

dates = pd.date_range('2015-04-01',periods=4)
df = pd.DataFrame(np.random.randn(4,4),index=dates,columns=list('ABCD'))
df

Unnamed: 0,A,B,C,D
2015-04-01,0.956545,-1.190426,1.212566,0.486812
2015-04-02,0.383112,-1.344361,-1.363729,0.500084
2015-04-03,2.012361,0.809969,-1.70458,0.70051
2015-04-04,-1.667865,1.539485,0.991267,-0.446862


## Numpy random.randint
The random.randint function generates random integers, however this doesnt appear to work directly with the pandas dataframe unless we reshape the output
<br>[Top](#Section)

In [68]:
dates = pd.date_range('2015-04-01',periods=4)
#df = pd.DataFrame(np.random.randint(25,100,16),index=dates,columns=list('ABCD')) <- Miss shape of dataframe and output
df = pd.DataFrame(np.random.randint(25,100,16).reshape(4,4),index=dates,columns=list('ABCD')) #<- reshaping of the output
df

Unnamed: 0,A,B,C,D
2015-04-01,41,43,81,40
2015-04-02,54,81,71,72
2015-04-03,57,67,42,82
2015-04-04,39,44,35,77


#### Note:
Difference between range and arange, range is a built-in function that will return a sequence of integers. Arange is a numpy function that returns an array, with a few more added features, step, data type, etc. The range function could be slower, arange is thought to be faster.


## Adding a New Column

Add by column name
<br>[Top](#Section)

In [69]:
df['E'] = np.random.randn(4,1)
#df['Z'] = np.random.randn(len(df['A']),1)
df

Unnamed: 0,A,B,C,D,E
2015-04-01,41,43,81,40,-0.250708
2015-04-02,54,81,71,72,-0.327101
2015-04-03,57,67,42,82,-0.542237
2015-04-04,39,44,35,77,-0.322042


Or we can add it this way

In [70]:
df['F'] = pd.Series(np.random.randn(len(df)), index=df.index)
df.loc[:,'G'] = pd.Series(np.random.randn(len(df)), index=df.index)
#df.insert(4,"F",np.random.randn(4,1),True) <- the problem with this is that it will continue to insert column F over and over again. The other method will over ride the old columns
df['H']=df['A'].mean()
df.insert(8,"I", np.random.randn(4,1))
df

Unnamed: 0,A,B,C,D,E,F,G,H,I
2015-04-01,41,43,81,40,-0.250708,0.253455,-1.188659,47.75,-0.514058
2015-04-02,54,81,71,72,-0.327101,0.613243,-2.467766,47.75,0.475953
2015-04-03,57,67,42,82,-0.542237,-0.683589,0.380328,47.75,1.572727
2015-04-04,39,44,35,77,-0.322042,-0.817768,0.26374,47.75,-1.607333


## Adding a New Row Without Index
We can add new rows with the append function
Here we create a new dict and append it to our df ignoring the index
<br>[Top](#Section)

In [71]:
new_row = {'A':53, 'B':77, 'C':66, 'D':25, 'E':69, 'F':33, 'G':16, 'H':62, 'I':88}
df_no_index = df.append(new_row, ignore_index=True)
#new_dates = pd.date_range('2015-04-05',periods=2)
#new_dates
df_no_index
#notice how the dates are no longer there

Unnamed: 0,A,B,C,D,E,F,G,H,I
0,41,43,81,40,-0.250708,0.253455,-1.188659,47.75,-0.514058
1,54,81,71,72,-0.327101,0.613243,-2.467766,47.75,0.475953
2,57,67,42,82,-0.542237,-0.683589,0.380328,47.75,1.572727
3,39,44,35,77,-0.322042,-0.817768,0.26374,47.75,-1.607333
4,53,77,66,25,69.0,33.0,16.0,62.0,88.0


## Add a New Row With Index

Adding a new row using append
[Top](#Section)

In [72]:
new_date = pd.date_range('2015-04-05',periods=1)
new_row = pd.Series(data = {'A':53, 'B':77, 'C':66, 'D':25, 'E':69, 'F':33, 'G':16, 'H':62, 'I':88}, name=new_date[0])
df_with_index = df.append(new_row, ignore_index=False)
df_with_index

Unnamed: 0,A,B,C,D,E,F,G,H,I
2015-04-01,41,43,81,40,-0.250708,0.253455,-1.188659,47.75,-0.514058
2015-04-02,54,81,71,72,-0.327101,0.613243,-2.467766,47.75,0.475953
2015-04-03,57,67,42,82,-0.542237,-0.683589,0.380328,47.75,1.572727
2015-04-04,39,44,35,77,-0.322042,-0.817768,0.26374,47.75,-1.607333
2015-04-05,53,77,66,25,69.0,33.0,16.0,62.0,88.0


## Merging Two Dataframes
Create second df to merge
 <br>[Top](#Section)

In [73]:
#https://pandas.pydata.org/docs/user_guide/merging.html
date2 = pd.date_range('2015-04-06',periods=2)
date3 = pd.date_range('2015-04-08',periods=2)
df2 = pd.DataFrame(np.random.randn(2,9),index=date2, columns=list('ABCDEFGHI'))
df3 = pd.DataFrame(np.random.randn(2,9),index=date3, columns=list('ABCDEFGHI'))
frames = [df_with_index, df2]
results = pd.concat(frames)
results = results.append(df3, sort=False)
#results = results.append([df4,df5])
results

Unnamed: 0,A,B,C,D,E,F,G,H,I
2015-04-01,41.0,43.0,81.0,40.0,-0.250708,0.253455,-1.188659,47.75,-0.514058
2015-04-02,54.0,81.0,71.0,72.0,-0.327101,0.613243,-2.467766,47.75,0.475953
2015-04-03,57.0,67.0,42.0,82.0,-0.542237,-0.683589,0.380328,47.75,1.572727
2015-04-04,39.0,44.0,35.0,77.0,-0.322042,-0.817768,0.26374,47.75,-1.607333
2015-04-05,53.0,77.0,66.0,25.0,69.0,33.0,16.0,62.0,88.0
2015-04-06,0.834285,0.243944,-0.501019,1.085482,-0.382791,-0.429156,2.06846,-0.060432,-0.107353
2015-04-07,-1.42412,-1.258914,0.994452,-0.878506,0.315204,0.761673,-0.168246,1.917681,-0.973591
2015-04-08,-1.164783,0.095151,0.145796,-0.989919,-0.782612,0.227995,-1.881158,-0.173856,-1.328391
2015-04-09,0.667637,1.364786,-0.828789,0.011943,-0.070555,-2.347114,1.034162,-0.901656,-0.710602


## Merging Two Columns

Merging the two dataframe
<br>[Top](#Section)

In [74]:
results['I'] = results['G']+results['H']
results['J'] = results['A'].astype(str)+results['B'].astype(str)
#results['A+B_J'] = results[['A'].astype(str),results['B'].astype(str)].agg('-'.join, axis=1)
results                      

Unnamed: 0,A,B,C,D,E,F,G,H,I,J
2015-04-01,41.0,43.0,81.0,40.0,-0.250708,0.253455,-1.188659,47.75,46.561341,41.043.0
2015-04-02,54.0,81.0,71.0,72.0,-0.327101,0.613243,-2.467766,47.75,45.282234,54.081.0
2015-04-03,57.0,67.0,42.0,82.0,-0.542237,-0.683589,0.380328,47.75,48.130328,57.067.0
2015-04-04,39.0,44.0,35.0,77.0,-0.322042,-0.817768,0.26374,47.75,48.01374,39.044.0
2015-04-05,53.0,77.0,66.0,25.0,69.0,33.0,16.0,62.0,78.0,53.077.0
2015-04-06,0.834285,0.243944,-0.501019,1.085482,-0.382791,-0.429156,2.06846,-0.060432,2.008027,0.83428511184650270.24394351667324835
2015-04-07,-1.42412,-1.258914,0.994452,-0.878506,0.315204,0.761673,-0.168246,1.917681,1.749436,-1.4241196084264514-1.2589143748648843
2015-04-08,-1.164783,0.095151,0.145796,-0.989919,-0.782612,0.227995,-1.881158,-0.173856,-2.055014,-1.16478278611765340.09515090479963807
2015-04-09,0.667637,1.364786,-0.828789,0.011943,-0.070555,-2.347114,1.034162,-0.901656,0.132506,0.66763718357461941.364786215116632


## Deleting Columns

Best way to do this is with the drop function
<br>[Top](#Section)

In [75]:
#Remember you can only run this once
#1 is the column axis and 0 is the row axis
results = results.drop('J',1)
results

Unnamed: 0,A,B,C,D,E,F,G,H,I
2015-04-01,41.0,43.0,81.0,40.0,-0.250708,0.253455,-1.188659,47.75,46.561341
2015-04-02,54.0,81.0,71.0,72.0,-0.327101,0.613243,-2.467766,47.75,45.282234
2015-04-03,57.0,67.0,42.0,82.0,-0.542237,-0.683589,0.380328,47.75,48.130328
2015-04-04,39.0,44.0,35.0,77.0,-0.322042,-0.817768,0.26374,47.75,48.01374
2015-04-05,53.0,77.0,66.0,25.0,69.0,33.0,16.0,62.0,78.0
2015-04-06,0.834285,0.243944,-0.501019,1.085482,-0.382791,-0.429156,2.06846,-0.060432,2.008027
2015-04-07,-1.42412,-1.258914,0.994452,-0.878506,0.315204,0.761673,-0.168246,1.917681,1.749436
2015-04-08,-1.164783,0.095151,0.145796,-0.989919,-0.782612,0.227995,-1.881158,-0.173856,-2.055014
2015-04-09,0.667637,1.364786,-0.828789,0.011943,-0.070555,-2.347114,1.034162,-0.901656,0.132506


In [76]:
#Deleting with column in place
results.drop('I', axis=1, inplace=True)
results

Unnamed: 0,A,B,C,D,E,F,G,H
2015-04-01,41.0,43.0,81.0,40.0,-0.250708,0.253455,-1.188659,47.75
2015-04-02,54.0,81.0,71.0,72.0,-0.327101,0.613243,-2.467766,47.75
2015-04-03,57.0,67.0,42.0,82.0,-0.542237,-0.683589,0.380328,47.75
2015-04-04,39.0,44.0,35.0,77.0,-0.322042,-0.817768,0.26374,47.75
2015-04-05,53.0,77.0,66.0,25.0,69.0,33.0,16.0,62.0
2015-04-06,0.834285,0.243944,-0.501019,1.085482,-0.382791,-0.429156,2.06846,-0.060432
2015-04-07,-1.42412,-1.258914,0.994452,-0.878506,0.315204,0.761673,-0.168246,1.917681
2015-04-08,-1.164783,0.095151,0.145796,-0.989919,-0.782612,0.227995,-1.881158,-0.173856
2015-04-09,0.667637,1.364786,-0.828789,0.011943,-0.070555,-2.347114,1.034162,-0.901656


In [77]:
#Deleting by column number
results = results.drop(df.columns[[6,7]], axis=1) #1 is column 0 is row
results

Unnamed: 0,A,B,C,D,E,F
2015-04-01,41.0,43.0,81.0,40.0,-0.250708,0.253455
2015-04-02,54.0,81.0,71.0,72.0,-0.327101,0.613243
2015-04-03,57.0,67.0,42.0,82.0,-0.542237,-0.683589
2015-04-04,39.0,44.0,35.0,77.0,-0.322042,-0.817768
2015-04-05,53.0,77.0,66.0,25.0,69.0,33.0
2015-04-06,0.834285,0.243944,-0.501019,1.085482,-0.382791,-0.429156
2015-04-07,-1.42412,-1.258914,0.994452,-0.878506,0.315204,0.761673
2015-04-08,-1.164783,0.095151,0.145796,-0.989919,-0.782612,0.227995
2015-04-09,0.667637,1.364786,-0.828789,0.011943,-0.070555,-2.347114


In [78]:
#Deleting by column name
results = results.drop(['E','F'], axis=1)
results

Unnamed: 0,A,B,C,D
2015-04-01,41.0,43.0,81.0,40.0
2015-04-02,54.0,81.0,71.0,72.0
2015-04-03,57.0,67.0,42.0,82.0
2015-04-04,39.0,44.0,35.0,77.0
2015-04-05,53.0,77.0,66.0,25.0
2015-04-06,0.834285,0.243944,-0.501019,1.085482
2015-04-07,-1.42412,-1.258914,0.994452,-0.878506
2015-04-08,-1.164783,0.095151,0.145796,-0.989919
2015-04-09,0.667637,1.364786,-0.828789,0.011943


In [79]:
#Or you can drop using certain value like in a case of multiple names
#df[df.column_name != 'Tina']

## Deleting Rows

Using the drop function<br>
[Top](#Section)

In [80]:
results = results.drop([pd.to_datetime('20150409')]) #defaul to axis 0 for row
results

Unnamed: 0,A,B,C,D
2015-04-01,41.0,43.0,81.0,40.0
2015-04-02,54.0,81.0,71.0,72.0
2015-04-03,57.0,67.0,42.0,82.0
2015-04-04,39.0,44.0,35.0,77.0
2015-04-05,53.0,77.0,66.0,25.0
2015-04-06,0.834285,0.243944,-0.501019,1.085482
2015-04-07,-1.42412,-1.258914,0.994452,-0.878506
2015-04-08,-1.164783,0.095151,0.145796,-0.989919


**Source:**<br>
    *Creating data:* https://pandas.pydata.org/pandas-docs/version/0.15.2/10min.html
    <br>
    *Range vs arange:* https://www.quora.com/What-is-the-difference-between-range-and-arange-in-Python#:~:text=The%20built%2Din%20range%20function,are%20stored%20in%20Numpy%20arrays.
    <br>
    *Merging* https://pandas.pydata.org/docs/user_guide/merging.html
    <br>
    *Adding DataFrame:* https://stackoverflow.com/questions/12555323/adding-new-column-to-existing-dataframe-in-python-pandas
    <br>
    *Creating DF using series:* https://www.geeksforgeeks.org/creating-a-dataframe-from-pandas-series/