Readme:


We encourage you to explore more functionalities in 'Python for Data Analysis, 3E' by Wes McKinney, Chapter 5: 'Getting Started with pandas'.</br>
Link: https://wesmckinney.com/book/pandas-basics

In [None]:
!pip install pandas

In [1]:
import numpy as np
import pandas as pd

<p>
Create a one-dimentional Series object with values 4, 7, -5, 3 and index "d", "b", "a", "c". </br>
Return its values, then return its index.
</p>


In [3]:
s = pd.Series([4, 7, -5, 3], index=["d", "b", "a", "c"])
print(s)
print(s.array) # type: <class 'pandas.core.arrays.numpy_.PandasArray'>
print(s.index)

d    4
b    7
a   -5
c    3
dtype: int64
<PandasArray>
[4, 7, -5, 3]
Length: 4, dtype: int64
Index(['d', 'b', 'a', 'c'], dtype='object')


<p>
Return the values from the above PandasArray that are greater than zero. </br>
</p>


In [10]:
print(s[s > 0])

d    4
b    7
c    3
dtype: int64


<p>
Multiply each element in the Series object by 3. </br>
</p>


In [12]:
s * 3

d    12
b    21
a   -15
c     9
dtype: int64

<p>
Another way to think about a Series is as a fixed-length, ordered dictionary, as it is a mapping of index values to data values. It can be used in many contexts where you might use a dictionary.</br>
E.g. we use dictionary key (not the value) to evaluate its presense.</br>
Try to find if value '-5' is in the Series first by using the value itself, then using its index.

</p>


In [None]:
print('-5' in s) # == False b/c it's the value, not the key
print('a' in s) 

False
True


<p>
Should you have data contained in a Python dictionary, you can create a Series from it by passing the dictionary. A Series can be converted back to a dictionary with its to_dict method.</br>
When you are only passing a dictionary, the index in the resulting Series will respect the order of the keys according to the dictionary's keys method, which depends on the key insertion order.  </br> </br>

Create a Series out of 'sdata' and return the result.</br>
Then use 'states' as index and return the result.</br>
</p>


In [4]:
sdata = {"Ohio": 35000, "Texas": 71000, "Oregon": 16000, "Utah": 5000}
obj = pd.Series(sdata)
print(obj)

states = ["California", "Ohio", "Oregon", "Texas"]
obj = pd.Series(sdata, index=states)

print(obj)


Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64


<p>
Return a boolean Series showing True where there is a missing value in the above Series. </br>
</p>


In [6]:
obj.isna()

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

<p>
Now return a boolean Series showing False where there is a missing value in the above Series. </br>
</p>


In [22]:
obj.notna()

California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

<p>
1. Create a dataframe out of below dictionary. Before you do this, think if the dictionary keys will be the index or column names?</br>
2. Add a new column to the dataframe name 'dept' with values in range of 6. </br>
3. Add a new boolean column named 'eastern' with True value where 'state' is 'Ohio'.
</p>


In [7]:
data = {"state": ["Ohio", "Ohio", "Ohio", "Nevada", "Nevada", "Nevada"],
        "year": [2000, 2001, 2002, 2001, 2002, 2003],
        "pop": [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)
frame['dept'] = np.arange(6)
frame['eastern'] = frame.state == 'Ohio'
frame

Unnamed: 0,state,year,pop,dept,eastern
0,Ohio,2000,1.5,0,True
1,Ohio,2001,1.7,1,True
2,Ohio,2002,3.6,2,True
3,Nevada,2001,2.4,3,False
4,Nevada,2002,2.9,4,False
5,Nevada,2003,3.2,5,False


<p>
Use below nested dictionary to create a dataframe </br>
Mind the outer dictionary keys and the inner dictionary keys - which ones are the index and which ones are the column names?
</p>


In [None]:
populations = {"Ohio": {2000: 1.5, 2001: 1.7, 2002: 3.6},
               "Nevada": {2001: 2.4, 2002: 2.9}}
df = pd.DataFrame(populations) # outer: column names, inner: index
df


Unnamed: 0,Ohio,Nevada
2000,1.5,
2001,1.7,2.4
2002,3.6,2.9


<p>
Now swap the columns and the rows of the above dataframe. </br>
</p>


In [30]:
df.T

Unnamed: 0,2000,2001,2002
Ohio,1.5,1.7,3.6
Nevada,,2.4,2.9


<p>
Convert above dataframe to numpy array. </br>
</p>


In [39]:
df.to_numpy()

array([[1.5, nan],
       [1.7, 2.4],
       [3.6, 2.9]])

<p>
1. Create a series out of values 4.5, 7.2, -5.3, 3.6 with index "d", "b", "a", "c". </br>
2. Try to mutate the index by assigning a new value to index label 'a'. What kind of error do you receive? Analyze the result.</br>
3. Try to change the index to "a", "b", "c", "d" using direct assignment. Does it work? It should, as you do NOT mutate existing index object, but create a new one (you may run id(df.index) to prove it).</br>
4. Then try to directly assign values "a", "b", "c", "d", "e" (different length!) to the index. Do you receive an error? </br>
5. Repeat the reassignment from step 4 by using 'reindex' technique and return the result. Observe the missing values in it.</br>
</br>

</p>

In [102]:
s = pd.Series([4.5, 7.2, -5.3, 3.6], index=["d", "b", "a", "c"])
#s.index[0] = 'x' # TypeError: Index does not support mutable operations
s.index = ["a", "b", "c", "d"] # ok: this is NOT a mutation, but a new object creation (it has diff id(s.index) comparing to the upper one.
#s.index = ["a", "b", "c", "d", "e"] # Length mismatch: Expected axis has 4 elements, new values have 5 elements 
s = s.reindex(["a", "b", "c", "d", "e"]) # ok
s
# !!!!!!! think of the diff b/w reindex and set_index!!!!!!!!!!




a    4.5
b    7.2
c   -5.3
d    3.6
e    NaN
dtype: float64

<p>
1. Create a series with values "blue", "purple", "yellow" and index 0, 2, 4. </br> 
2. Change the index to the range of 6 and observe the missing values.</br> 
3. Repeat step 2 using 'ffill' method to fill the missing values. Analyse the result. </br>

</p>


In [5]:
obj3 = pd.Series(["blue", "purple", "yellow"], index=[0,2,4])
#obj3=obj3.reindex(np.arange(6))
obj3=obj3.reindex(np.arange(6), method='ffill')
print(obj3)


0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object


<p>
1. Create a dataframe out of range of 9 in shape(3, 3) with index "a", "c", "d" and columns "Ohio", "Texas", "California". </br>
2. Then change the columns to "Texas", "Utah", "California". </br>
 

</p>


In [None]:
frame = pd.DataFrame(np.arange(9).reshape((3, 3)),
                     index=["a", "c", "d"],
                     columns=["Ohio", "Texas", "California"])
states = ["Texas", "Utah", "California"]
frame2 = frame.reindex(columns=states) # or frame.reindex(states, axis='columns') 
frame2

Unnamed: 0,Texas,Utah,California
a,1,,2
c,4,,5
d,7,,8


<p>
1. Create a dataframe out of the range of 16 with shape(4, 4) with index "Ohio", "Colorado", "Utah", "New York" and columns "one", "two", "three", "four" </br>
2. Then return a new dataframe by dropping index 'Ohio' and columns 'one', 'two'.</br>
                
</p>


In [88]:
df = pd.DataFrame(np.arange(16).reshape(4,4), 
                  index = ["Ohio", "Colorado", "Utah", "New York"],
                  columns = ["one", "two", "three", "four"])
df2 = df.drop('Ohio')
df2 = df2.drop(['one', 'two'], axis=1) # or df.drop(columns=['one','two'])
df2

Unnamed: 0,three,four
Colorado,6,7
Utah,10,11
New York,14,15


<p>
1. Given below dataframe, return the data where value in column 'three' is greater than 5. </br>
2. Then assign zero to all values that are less then 5.
</p>


In [None]:
df = pd.DataFrame(np.arange(16).reshape((4, 4)),
                     index=["Ohio", "Colorado", "Utah", "New York"],
                     columns=["one", "two", "three", "four"])
print(df[df['three'] > 5])
df[df < 5] = 0
print(df)

          one  two  three  four
Colorado    4    5      6     7
Utah        8    9     10    11
New York   12   13     14    15
          one  two  three  four
Ohio        0    0      0     0
Colorado    0    5      6     7
Utah        8    9     10    11
New York   12   13     14    15


<p>
Using the dataframe from the previous task, return columns "one", "two", "three" only where values are greater than 5.</br>
First, use 'loc' oprator, then use 'iloc' operator. </br>
</p>


In [124]:
print(df.loc[:, :'three'][df > 5])
print(df.iloc[:, :3][df > 5])
#print(df[['one', 'two', 'three']][df > 5]) # does the same

           one   two  three
Ohio       NaN   NaN    NaN
Colorado   NaN   NaN    6.0
Utah       8.0   9.0   10.0
New York  12.0  13.0   14.0
           one   two  three
Ohio       NaN   NaN    NaN
Colorado   NaN   NaN    6.0
Utah       8.0   9.0   10.0
New York  12.0  13.0   14.0


<p>
Now using the same dataframe, return the value located at row 'Utah' and column 'two' using 'at' or 'iat' operators. </br>
What is the difference between 'loc' and 'at' operators?
</p>


In [None]:
print(df.at['Utah', 'two']) # df.at can only access a single value at a time. df.loc can select multiple rows and/or columns
print(df.iat[2, 1])

#print(df.loc['Utah', 'two'])

9
9
9


<p>
What will happen if you add below dataframes using '+' operator? </br>
Return the result. </br>
</p>


In [125]:
df1 = pd.DataFrame({"A": [1, 2]})
df2 = pd.DataFrame({"A": [3, 4]})
df1 + df2

Unnamed: 0,A
0,4
1,6


<p>
What will happen if you add below dataframes (different column names) using '+' operator?  </br>
Return the result. </br>
</p>


In [126]:
df1 = pd.DataFrame({"A": [1, 2]})
df2 = pd.DataFrame({"B": [3, 4]})
df1 + df2

Unnamed: 0,A,B
0,,
1,,


<p>
Now use the dataframes from the prior task and use 'add' method with argument 'fill_value=0' which substitutes the passed value for any missing values in the operation. </br>
</p>


In [135]:
print(df1.add(df2, fill_value=0)) # on top of 0 it adds the original (passed) values from the df's, eg. 0 + 1 = 1

     A    B
0  1.0  3.0
1  2.0  4.0


<p>
Now for each value in df1 return the result of division operation '1/value':
 </br>
</p>


In [136]:
print(1/df1) # or df1.rdiv(1)

     A
0  1.0
1  0.5


<p>
1. Create a datafarme out of random float numbers using np.random.standard_normal in shape(4, 3) with columns list("bde") and index "Utah", "Ohio", "Texas", "Oregon"</br>
2. Define a custom function that returns a difference between maximum and minimum values of a given sequence.</br>
3. Apply the function to the dataframe. What is the default axis for this function? Try to change the axis and analyze the result.</br>
4. Now repeat the process using lambda function. 

# axis 0: iterate through rows top >> bottom; axis 1: iterate through columns: left >> right

</p>


In [147]:
df = pd.DataFrame(np.random.standard_normal((4,3)), columns=list("bde"), index=["Utah", "Ohio", "Texas", "Oregon"])
print(df)

def f1(x):
    return x.max() - x.min()

print(df.apply(f1))
print(df.apply(f1, axis=1))

print(df.apply(lambda x: x.max() - x.min()))


               b         d         e
Utah    0.976660 -0.454376 -0.413014
Ohio   -0.253685  0.516224 -0.862965
Texas   0.868212  1.400188  0.836135
Oregon  0.883565 -1.087859 -0.636401
b    1.230345
d    2.488046
e    1.699100
dtype: float64
Utah      1.431035
Ohio      1.379189
Texas     0.564053
Oregon    1.971423
dtype: float64
b    1.230345
d    2.488046
e    1.699100
dtype: float64


<p>
Create a function that accepts an input float number like '3.475911' and returns a string formatted like '3.476'.</br>
Then apply it to each element of the dataframe created in the previous task. </br>
</p>


In [148]:
def f2(x):
    return f'{x: .3f}'
df.applymap(f2)

Unnamed: 0,b,d,e
Utah,0.977,-0.454,-0.413
Ohio,-0.254,0.516,-0.863
Texas,0.868,1.4,0.836
Oregon,0.884,-1.088,-0.636


<p>
1. Create a dataframe out of the range of 8 with shape(2, 4) with index "three", "one" and columns "d", "a", "b", "c".</br>
2. Sort it by its index in descending order.</br>
3. Then sort it by the column index.</br>
4. Then sort it by the values in columns 'a' and 'b'.</br>

</p>


In [152]:
df = pd.DataFrame(np.arange(8).reshape(2,4), index=["three", "one"], columns=["d", "a", "b", "c"])
print(df.sort_index(ascending=False))
print(df.sort_index(axis=1))
print(df.sort_values(['a', 'b']))

       d  a  b  c
three  0  1  2  3
one    4  5  6  7
       a  b  c  d
three  1  2  3  0
one    5  6  7  4
       d  a  b  c
three  0  1  2  3
one    4  5  6  7


<p>
1. Create a Series out of the range of 5 with index "a", "a", "b", "b", "c". </br>
2. Return a boolean value that indicates if the index values are unique.</br>
3. Return unique indexes only.</br>
4. Return unique data values only.</br>
</p>


In [None]:
s = pd.Series(np.arange(5), index = ["a", "a", "b", "b", "c"])
print(s.index.is_unique)
print(s.index.unique()) # you can do: s.index.unique()[0] - to return scalar values
print(s.unique())


False
[0 1 2 3 4]


<p>
1. Guess what data type it will return if you select data using a non-unique index and if using a unique index.</br>
2. Use the series from the previous task to test it and analyze the result.</br>

Hint: 
This can make your code more complicated, as the output type from indexing can vary based on whether or not a label is repeated! </br>
Same logic extends to a dataframe: check the next task.
</p>


In [None]:
print(s['a']) # Series for non-unique
print(s['c']) # scalar for unique

a    0
a    1
dtype: int32
4


<p>
1. Create a dataframe out of random float number using np.random.standard_normal in shape(5, 3) with index "a", "a", "b", "b", "c". </br>
2. Use 'loc' operator to return values for index 'a'. What data type it returns?</br>
3. Now use the same operator to return values for index 'c' this time. Does it return the same data type? Why?</br>
4. Finally use the same operator to return the value located at row 'c', column '0'. What data type it returns this time?
</p>


In [None]:
df = pd.DataFrame(np.random.standard_normal((5, 3)),
                   index=["a", "a", "b", "b", "c"])
print(df.loc['a']) # dtype df for non-unique
print(df.loc['c']) # dtype Series for unique column, but multiple rows
print(df.loc['c', 0]) # dtype Scalar


          0         1         2
a  0.578379 -0.375141  1.122419
a -1.689073  0.080801 -2.104061
0   -0.523181
1   -1.489312
2    0.218787
Name: c, dtype: float64
-0.5231809940727353


<p>
Condition </br>
</p>


<p>
Condition </br>
</p>


<p>
Condition </br>
</p>
