References:
- progrmaviz
- geeksforgeeks
- python.org

# The pythonic way of coding.

#### List comprehensions

- A better alternative to ``for`` loops. 
- Easier to code and understand. 
- (Slightly) Faster than for loops. 

In [1]:
# Create a list with even numbers between 0 and 20

In [2]:
even_numbers=[]
for i in range(20):
    if(i%2==0):
        even_numbers.append(i)
print(even_numbers)

[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]


In [3]:
number_list = [ x for x in range(20) if x % 2 == 0]
print(number_list)

[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]


In [4]:
# With strings

In [5]:
h_letters = []
s="data science"

for letter in s:
    h_letters.append(letter)

print(h_letters)

['d', 'a', 't', 'a', ' ', 's', 'c', 'i', 'e', 'n', 'c', 'e']


In [71]:
[l for l in s]

['d', 'a', 't', 'a', ' ', 's', 'c', 'i', 'e', 'n', 'c', 'e']

In [6]:
# Using conditionals

In [7]:
obj = ["Even" if i%2==0 else "Odd" for i in range(10)]
print(obj)

['Even', 'Odd', 'Even', 'Odd', 'Even', 'Odd', 'Even', 'Odd', 'Even', 'Odd']


#### Args and kwargs

- The special parameter *args in function definitions in python is used to pass a variable number of arguments to a function.

- With *args, any number of extra arguments can be tacked on to your current formal parameters (including zero extra arguments).

- For example : we want to make a multiply function that takes any number of arguments and able to multiply them all together. It can be done using *args.

In [8]:
# argv is a tuple
def myFun(*argv):  
    prdt=1
    for arg in argv:  
        prdt*=arg
    return prdt
    
myFun(1,2,3,4,5,6)  

720

In [9]:
# Python program to illustrate  
# *args with first extra argument 

def myFun(arg1, *argv): 
    print ("First argument :", arg1) 
    for arg in argv: 
        print("Next argument through *argv :", arg) 
  
myFun('Hello', 'Welcome', 'to', 'DSC80') 

First argument : Hello
Next argument through *argv : Welcome
Next argument through *argv : to
Next argument through *argv : DSC80


- In the same way, ****kwargs** is used to pass varialbe number of arguements.  
- These arguements, unlike those passed using *args have an identifier.

In [52]:
def myFun(**kwargs):  
    for key, value in kwargs.items(): 
        print ("%s == %s" %(key, value)) 
  
myFun(first ='Geeks', mid ='for', last='Geeks')   

first == Geeks
mid == for
last == Geeks


### Truth value for conditional testing

Any object can be tested for truth value, for use in a conditional or as operand of the Boolean operations below. The following values are considered false:

- None
- False
- zero of any numeric type, for example, 0, 0L, 0.0, 0j.
- any empty sequence, for example, '', (), [].
- any empty mapping, for example, {}.

In [11]:
if(False):
    print("This wont execute")

#### Unpacking tuple values 
- Tuple behave exactly like lists, but are immuatable.
- Why use tuples?
- How do you change "a" to "b" in ("a","c","d")

In [12]:
# Program to understand about packing and unpacking in Python 
# This lines PACKS values into variable a 
a = ("DSC", 80, "UCSD")   
  
# this lines UNPACKS values of variable a 
(Course, code, Uni) = a   
  
print(Course) 
print(code) 
print(Uni) 

DSC
80
UCSD


#### Shallow vs deep copy in Pandas

In [13]:
import pandas as pd
import numpy as np

In [14]:
A=pd.DataFrame(np.random.random((5,2)),columns=["col1","col2"])
A.head()

Unnamed: 0,col1,col2
0,0.411008,0.976684
1,0.465557,0.451788
2,0.666848,0.814625
3,0.490985,0.176665
4,0.931537,0.342066


In [15]:
B=A
B.head()

Unnamed: 0,col1,col2
0,0.411008,0.976684
1,0.465557,0.451788
2,0.666848,0.814625
3,0.490985,0.176665
4,0.931537,0.342066


In [16]:
B.drop(["col1"],axis="columns",inplace=True)

In [17]:
B.head()

Unnamed: 0,col2
0,0.976684
1,0.451788
2,0.814625
3,0.176665
4,0.342066


In [18]:
A.head()

Unnamed: 0,col2
0,0.976684
1,0.451788
2,0.814625
3,0.176665
4,0.342066


### Context managers - ``with``

- Context managers allow us to allocate and release resources. 

In [19]:
file=open("SacramentocrimeJanuary2006.csv","w")

- The file is locked for editing until it is explicity closed!

In [20]:
file.close()

- Ideally, a file open would be handled in the following way:

In [21]:
file = open('SacramentocrimeJanuary2006.csv', 'a')
try:
    file.write('abcde')
finally:
    print("Closing file")
    file.close()

Closing file


This is equivalent to:

In [22]:
with open("SacramentocrimeJanuary2006.csv", 'w') as opened_file:
    opened_file.write('abcde')

### Iterators
- An object that may be used to get the "next" item in an iterable. 
- Iterators are not indexible!

In [23]:
mytuple = ("apple", "banana", "cherry")
myit = iter(mytuple)

print(next(myit))
print(next(myit))
print(next(myit))


apple
banana
cherry


In [24]:
for i in mytuple:
    print(i)

apple
banana
cherry


- What really is happening in the for loop?

In [25]:
# create an iterator object from that iterable
iter_obj = iter(mytuple)

# infinite loop
while True:
    try:
        # get the next item
        element = next(iter_obj)
        print(element)
        # do something with element
    except StopIteration:
        # if StopIteration is raised, break from loop
        break

apple
banana
cherry


### More unpacking

In [26]:
a=10; b=20

WAP to swap a and b.

### Create a length-N list of the same thing

In [27]:
# Create a list with 100 1s. 

### Searching for an item in a collection

In [29]:
if "s" in "dsc":
    print("s is in dsc")

s is in dsc


### Improve (Pythonize) this!

In [30]:
d = {'hello': 'world'}
if ('hello' in d):
    print (d['hello'])    # prints 'world'
else:
    print('default_value')

world


* more here: https://docs.python-guide.org/writing/style/
* and here: https://gist.github.com/dpallot/1aadff223f3b3efbec8e

## Relationship between Pandas and NumPy

### Pandas requires NumPy in order to run

* The underlying data structure of pd's dataframe is np's array
* NumPy is optimized for performing numerical computation on the collection e.g mean, meadian, etc. 
* Pandas provides another abstraction level that helps users interact with tabular data and perform analysis on them e.g. join, filter, engineer features, etc.

The key takeaways: 
- There are some tasks that can obviously be done using either Pandas or NumPy. 
- Pandas might not always be slower than NumPy. It depends on the task, the size of data and the environment which they're being run.
- However, when it comes to where you need to make decision:
    - On your development machine: choose one that is simpler to understand, easy to implement
    - On production: choose one that has better performance

In [61]:
import time
df = pd.read_csv("data.csv")
df.head(10)

Unnamed: 0,cdatetime,address,district,beat,grid,crimedescr,ucr_ncic_code,latitude,longitude
0,1/1/06 0:00,3108 OCCIDENTAL DR,3,3C,1115,10851(A)VC TAKE VEH W/O OWNER,2404,38.55042,-121.3914158
1,1/1/06 0:00,2082 EXPEDITION WAY,5,5A,1512,459 PC BURGLARY RESIDENCE,2204,38.473501,-121.4901858
2,1/1/06 0:00,4 PALEN CT,2,2A,212,10851(A)VC TAKE VEH W/O OWNER,2404,38.657846,-121.4621009
3,1/1/06 0:00,22 BECKFORD CT,6,6C,1443,476 PC PASS FICTICIOUS CHECK,2501,38.506774,-121.4269508
4,1/1/06 0:00,3421 AUBURN BLVD,2,2A,508,459 PC BURGLARY-UNSPECIFIED,2299,38.637448,-121.3846125
5,1/1/06 0:00,5301 BONNIEMAE WAY,6,6B,1084,530.5 PC USE PERSONAL ID INFO,2604,38.526979,-121.4513383
6,1/1/06 0:00,2217 16TH AVE,4,4A,957,459 PC BURGLARY VEHICLE,2299,38.537173,-121.4875774
7,1/1/06 0:00,3547 P ST,3,3C,853,484 PC PETTY THEFT/INSIDE,2308,38.564335,-121.4618826
8,1/1/06 0:00,3421 AUBURN BLVD,2,2A,508,459 PC BURGLARY BUSINESS,2203,38.637448,-121.3846125
9,1/1/06 0:00,1326 HELMSMAN WAY,1,1B,444,1708 US THEFT OF MAIL,2310,38.609602,-121.4918375


### Use library function to optimize time and space

In [32]:
#sum data in dataframe
import time
t0 = time.time()
df['grid'].sum()
print(time.time() - t0)

0.0007936954498291016


In [33]:
t0 = time.time()
result = 0
for v in df['grid'].values:
    result += v
print(time.time() - t0)

0.0022170543670654297


In [36]:
#space usage
df[['grid']].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7584 entries, 0 to 7583
Data columns (total 1 columns):
grid    7584 non-null int64
dtypes: int64(1)
memory usage: 59.3 KB


In [37]:
sys.getsizeof(list(df[['grid']].values))/1000

68.368

### Types in Pandas and NumPy

|Pandas dtype|Python type|NumPy type|Usage|
|---|---|---|---|
|object|NA|object|Mixed types|
|object|str|string, unicode|Text|
|int64|int|int_, int8,...,int64, uint8,...,uint64|Integer numbers|
|float64|float|float_, float16, float32, float64|Floating point numbers|
|bool|bool|bool_|True/False values|
|datetime64|NA|datetime64[ns]|Date and time values|
|timedelta[ns]|NA|NA|Differences between two datetimes|
|category|NA|NA|Finite list of text values|

In [38]:
df = df[:100]
df.dtypes

cdatetime         object
address           object
district           int64
beat              object
grid               int64
crimedescr        object
ucr_ncic_code      int64
latitude         float64
longitude         object
dtype: object

### String 

String in Numpy has fixed size (c-like) while Pandas uses native Python's string that is an object.

In [39]:
df['crimedescr'].values[:16] #type of string here is object (even though it's np array)

array(['10851(A)VC TAKE VEH W/O OWNER', '459 PC  BURGLARY RESIDENCE',
       '10851(A)VC TAKE VEH W/O OWNER', '476 PC PASS FICTICIOUS CHECK',
       '459 PC  BURGLARY-UNSPECIFIED', '530.5 PC USE PERSONAL ID INFO',
       '459 PC  BURGLARY VEHICLE', '484 PC   PETTY THEFT/INSIDE',
       '459 PC  BURGLARY BUSINESS', '1708 US   THEFT OF MAIL',
       'ASSAULT WITH WEAPON - I RPT', '530.5 PC USE PERSONAL ID INFO',
       'SUSP PERS-NO CRIME - I RPT', '530.5 PC USE PERSONAL ID INFO',
       '484G(B) PC ACCESS CARD FRAUD', '487(A) PC GRAND THEFT'],
      dtype=object)

In [64]:
#if we were to create the same array in numpy with its original type
# U stands for 'Unicode String'
np.array(['10851(A)VC TAKE VEH W/O OWNER', '459 PC  BURGLARY RESIDENCE',
       '10851(A)VC TAKE VEH W/O OWNER', '476 PC PASS FICTICIOUS CHECK',
       '459 PC  BURGLARY-UNSPECIFIED', '530.5 PC USE PERSONAL ID INFO',
       '459 PC  BURGLARY VEHICLE', '484 PC   PETTY THEFT/INSIDE',
       '459 PC  BURGLARY BUSINESS', '1708 US   THEFT OF MAIL'], dtype='|U10')

array(['10851(A)VC', '459 PC  BU', '10851(A)VC', '476 PC PAS',
       '459 PC  BU', '530.5 PC U', '459 PC  BU', '484 PC   P',
       '459 PC  BU', '1708 US   '], dtype='<U10')

In [69]:
#Pandas allows fixed length string too, just doesn't happen automatically
df['crimedescr'].astype('|S10').head(5)

0    b'10851(A)VC'
1    b'459 PC  BU'
2    b'10851(A)VC'
3    b'476 PC PAS'
4    b'459 PC  BU'
Name: crimedescr, dtype: bytes80

In [70]:
#Doing astype('|U10') will convert it to object in pandas again as numpy's unicode = object in padas
df['crimedescr'].astype('|U10').head(5)

0    10851(A)VC TAKE VEH W/O OWNER
1       459 PC  BURGLARY RESIDENCE
2    10851(A)VC TAKE VEH W/O OWNER
3     476 PC PASS FICTICIOUS CHECK
4     459 PC  BURGLARY-UNSPECIFIED
Name: crimedescr, dtype: object

### Manipulating the types

In [42]:
df.dtypes

cdatetime         object
address           object
district           int64
beat              object
grid               int64
crimedescr        object
ucr_ncic_code      int64
latitude         float64
longitude         object
dtype: object

In [43]:
#using function astype() to convert a column to a specific type
df['district'] = df['district'].astype('category')

In [44]:
df.district.cat.categories

Int64Index([1, 2, 3, 4, 5, 6], dtype='int64')

In [45]:
#rename the category and it will replace the entire data for you e.g. 1->A, 2->B
df.district = df.district.cat.rename_categories(['A', 'B', 'C', 'D', 'E', 'F'])

In [56]:
df.head(4)

Unnamed: 0,cdatetime,address,district,beat,grid,crimedescr,ucr_ncic_code,latitude,longitude
0,2006-01-01,3108 OCCIDENTAL DR,C,3C,1115,10851(A)VC TAKE VEH W/O OWNER,2404,38.55042,-121.3914158
1,2006-01-01,2082 EXPEDITION WAY,E,5A,1512,459 PC BURGLARY RESIDENCE,2204,38.473501,-121.4901858
2,2006-01-01,4 PALEN CT,B,2A,212,10851(A)VC TAKE VEH W/O OWNER,2404,38.657846,-121.4621009
3,2006-01-01,22 BECKFORD CT,F,6C,1443,476 PC PASS FICTICIOUS CHECK,2501,38.506774,-121.4269508


### What's happening under the hood?
- Pandas store the categories as a map, where the category names are key and each key is associated with an integer value
- This is useful because it is more space efficient (storing int instead of string objects)

<img src ='https://www.dataquest.io/blog/content/images/categorical.svg' align='middle'>

In lecture you looked at astype() function, alternatively we can call conversion functions for specific types too.

In [46]:
df['cdatetime'] = pd.to_datetime(df['cdatetime'], format='%d/%m/%y %H:%M')

In [47]:
df.dtypes

cdatetime        datetime64[ns]
address                  object
district               category
beat                     object
grid                      int64
crimedescr               object
ucr_ncic_code             int64
latitude                float64
longitude                object
dtype: object

Having columns with the right types allows you to select columns based on the types.

In [50]:
df.select_dtypes(exclude=['category']).head(2)

Unnamed: 0,cdatetime,address,beat,grid,crimedescr,ucr_ncic_code,latitude,longitude
0,2006-01-01,3108 OCCIDENTAL DR,3C,1115,10851(A)VC TAKE VEH W/O OWNER,2404,38.55042,-121.3914158
1,2006-01-01,2082 EXPEDITION WAY,5A,1512,459 PC BURGLARY RESIDENCE,2204,38.473501,-121.4901858


In [51]:
#select all numerics
df.select_dtypes(include=['int', 'float']).head(2)

Unnamed: 0,grid,ucr_ncic_code,latitude
0,1115,2404,38.55042
1,1512,2204,38.473501


## Dealing with Warning Messages

- It doesn't mean that your solution is wrong
- However, I would recommend trying to fix them (if it comes from your code) because:
    - Warning messages can reduce the readability of your code and hence harder to debug and scale in the future
    - Sometimes it helps you optimize your code or call libraries in a more appropriate way
- If it comes from function calls inside the library that you are using:
    - You can ignore it or try to upgrade to the next version