### Writing an efficient Python code 

In this project, we will explore how to write a Python program that is readable while at the same time efficient with the fast runtime and the minimal memory usage.

The hip_sp.csv is a short version of the main Hipparcos catalog. The hip_sp.csv file contains 118218 rows and 11 columns, selected out of 78 columns from the main catalog. The selected columns are:

<ul>    
<li> Hip_No -- unique Hipparcos number </li>
<li> Alpha in (h,m,s) & Delta in (d,m,s)-- right ascension and declination represent the stellar coordinates </li>
<li> Vmag -- visual magnitude is a measure of the apparent stellar brightness </li>
<li> B-V and V-I -- color indexes indicate star's color </li>
<li> Plx -- trigonometric parallax in milli arcseconds </li>
<li> e_Plx -- standard error in Plx in milliarcseconds </li>
<li> Var_period -- a period (in days) for variable stars </li>
<li> Var_type -- type of variability </li>
<li> Spectral_type -- a spectral type of an object that represents stellar temperature and color </li>
</ul>

Some additional columns calculated based on the given set of columns are :
<ul>
 <li> Mv -- absolute stellar magnitude, is a measure of the real stellar brightness and it is calculated from the Hipparcos apparent visual magnitude (Vmag) and the Hipparcos measured parallax (Plx). </li>
<li> distance -- in parsecs is calculated using apparent magnitude mv and absolute magnitude Mv </li>
</ul>

### Importing data 

In [1]:
%%time
import numpy as np
import pandas as pd

file = '../data/hip_sp.csv'

new_column_names = ['Hip_No', 'Alpha', 'Delta','Vmag', 'B-V', 'V-I', 'Plx', 'e_Plx', 'Var_period', 'Var_type','Spectral_type']
hip_sp1 = pd.read_csv(file, header = None, sep =',',
                usecols = [1,2,3,4,5,6,7,8,9,10,11],  
                names = new_column_names,
                low_memory = False)

hip_sp1.head(5)

Wall time: 703 ms


Unnamed: 0,Hip_No,Alpha,Delta,Vmag,B-V,V-I,Plx,e_Plx,Var_period,Var_type,Spectral_type
0,1,00 00 00.22,+01 05 20.4,9.1,3.54,1.39,0.482,0.55,,,F5
1,2,00 00 00.91,-19 29 55.8,9.27,21.9,3.1,0.999,1.04,,C,K3V
2,3,00 00 01.20,+38 51 33.4,6.61,2.81,0.63,-0.019,0.0,,C,B9
3,4,00 00 02.01,-51 53 36.8,8.06,7.75,0.97,0.37,0.43,,,F0V
4,5,00 00 02.39,-40 35 28.4,8.55,2.87,1.11,0.902,0.9,,,G8III


### Changing data types of the columns

In [2]:
col_list = ['Vmag', 'Plx', 'e_Plx', 'B-V', 'V-I', 'Var_period']

for  col in col_list:
    hip_sp1[col] = pd.to_numeric(hip_sp1[col],  errors = 'coerce')
    
hip_sp1.info(verbose = True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 118218 entries, 0 to 118217
Data columns (total 11 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   Hip_No         118218 non-null  int64  
 1   Alpha          118218 non-null  object 
 2   Delta          118218 non-null  object 
 3   Vmag           118217 non-null  float64
 4   B-V            117955 non-null  float64
 5   V-I            117955 non-null  float64
 6   Plx            116937 non-null  float64
 7   e_Plx          116943 non-null  float64
 8   Var_period     2541 non-null    float64
 9   Var_type       118218 non-null  object 
 10  Spectral_type  118218 non-null  object 
dtypes: float64(6), int64(1), object(4)
memory usage: 9.9+ MB


### Calculating the absolute stellar magnitudes

In [3]:
def find_absolute_mag(df):
    invalid_values = np.seterr(invalid='ignore')
    df['Mv'] = df['Vmag'] + 5 - 5*np.log10(1000/df['Plx']) 

In [4]:
find_absolute_mag(hip_sp1)

#filter rows with NaN values in Mv columns

hip_sp = hip_sp1.dropna(subset = ['Mv'])
hip_sp.head(5)  

Unnamed: 0,Hip_No,Alpha,Delta,Vmag,B-V,V-I,Plx,e_Plx,Var_period,Var_type,Spectral_type,Mv
0,1,00 00 00.22,+01 05 20.4,9.1,3.54,1.39,0.482,0.55,,,F5,-2.484765
1,2,00 00 00.91,-19 29 55.8,9.27,21.9,3.1,0.999,1.04,,C,K3V,-0.732173
3,4,00 00 02.01,-51 53 36.8,8.06,7.75,0.97,0.37,0.43,,,F0V,-4.098991
4,5,00 00 02.39,-40 35 28.4,8.55,2.87,1.11,0.902,0.9,,,G8III,-1.673967
5,6,00 00 04.35,+03 56 47.4,12.31,18.8,4.99,1.336,1.55,,,M0V:,2.939032


###  Pythonic .vs. non-pythonic code

How many stars from our Hp_Sp.csv file are more luminous than the Sun, knowing that the absolute magnitude of the Sun is 4.83? To answer this question,  we need to count the number of stars from the Mv column of the hip_sp data frame. All-stars from the catalog with the absolute magnitudes, Mv, less than 4.83, are more luminous than our Sun.  

In [5]:
%%time

#Pythonic Way

star_list = [mag for mag in hip_sp['Mv'] if mag < 4.83]

print(len(star_list))

110041
Wall time: 13 ms


### Examining runtime 

To select the most efficient code we will examine the runtime using one of the magic commands. The module timeit will time many executions for one statement. We can set the number of runs using -r option and the number of loops using -n option. On the other hand, the %%time command measures actual time to complete a command and it can be affected by any other operations in the computer. 

In [6]:
import timeit

%timeit star_list = [mag for mag in hip_sp['Mv'] if mag < 4.83]

%timeit -r2 -n10 star_list = [mag for mag in hip_sp['Mv'] if mag < 4.83]

12.5 ms ± 25.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
13.1 ms ± 2.54 µs per loop (mean ± std. dev. of 2 runs, 10 loops each)


For example, we can compare times that takes to create a list by using the standard syntax [] or by using Python's built-in function list().

In [7]:
%timeit -r2 -n10 Mv_list1 = [hip_sp['Mv']]

%timeit -r2 -n10 Mv_list2 = list(hip_sp['Mv'])

5.97 µs ± 3 µs per loop (mean ± std. dev. of 2 runs, 10 loops each)
9.74 ms ± 118 µs per loop (mean ± std. dev. of 2 runs, 10 loops each)


###  List of Hipparacos numbers for different stars 

Let's create a list of Hip Ids and an indexed list of absolute magnitudes using Python's built-in functions.

In [8]:
%%time
hip_id_list = list(hip_sp['Hip_No'])

hip_id_list1 = [* range(1, hip_id_list[-1])]
print(len(hip_id_list1))

118320
Wall time: 11 ms


In [9]:
%%time
mag_list = list(hip_sp['Mv'])

indexed_list = [* enumerate(mag_list, 1)]
print(indexed_list[0])

(1, -2.4847648088057515)
Wall time: 24 ms


### Rounding values using  dataframes

In [10]:
%%time

hip_sp2 = hip_sp.round({'Mv': 2})
print(hip_sp2.head(5))

   Hip_No        Alpha        Delta   Vmag    B-V   V-I    Plx  e_Plx  \
0       1  00 00 00.22  +01 05 20.4   9.10   3.54  1.39  0.482   0.55   
1       2  00 00 00.91  -19 29 55.8   9.27  21.90  3.10  0.999   1.04   
3       4  00 00 02.01  -51 53 36.8   8.06   7.75  0.97  0.370   0.43   
4       5  00 00 02.39  -40 35 28.4   8.55   2.87  1.11  0.902   0.90   
5       6  00 00 04.35  +03 56 47.4  12.31  18.80  4.99  1.336   1.55   

   Var_period Var_type Spectral_type    Mv  
0         NaN           F5           -2.48  
1         NaN        C  K3V          -0.73  
3         NaN           F0V          -4.10  
4         NaN           G8III        -1.67  
5         NaN           M0V:          2.94  
Wall time: 19 ms


In [11]:
%%time

Mv_list = round(hip_sp['Mv'], 2)
print(Mv_list[0:5])

0   -2.48
1   -0.73
3   -4.10
4   -1.67
5    2.94
Name: Mv, dtype: float64
Wall time: 14 ms


### Using NumPy array 

Using NumPy arrays is the most efficient way of applying complex calculations on a set of numbers. 

In [12]:
%%time
#list of right ascension in degrees

alpha_list = [*range(1,360,1)]
alpha_np = np.array(alpha_list)
alpha_np_c = np.cos(alpha_np)*np.sin(alpha_np)
print(alpha_np_c[0:10])

[ 0.45464871 -0.37840125 -0.13970775  0.49467912 -0.27201056 -0.26828646
  0.49530368 -0.14395166 -0.37549362  0.45647263]
Wall time: 1e+03 µs


### Combining objects

We will combine the list of stellar absolute magnitudes with the list of stellar spectral types and try to find the most efficient way of combining two objects. Using zip method is more efficient than using for loop. 

In [13]:
%%time

Mv_list = hip_sp['Mv']
Sp_list = hip_sp['Spectral_type']

star_infos_zip = zip(Mv_list, Sp_list)
star_infos_zip_list = [* star_infos_zip]

print(type(star_infos_zip_list))
print(star_infos_zip_list[0:3])

<class 'list'>
[(-2.4847648088057515, 'F5          '), (-0.7321725588700883, 'K3V         '), (-4.098991379665025, 'F0V         ')]
Wall time: 28 ms


### Counting and grouping in Python

In this step, we will try to find out what is the fastest way to count how many stars belong to each spectral type. We have 113759 stars from the Hipparcos catalog. First, we will use the standard method of counting using loops, then we will count using a special Python object called Counter based on dictionary object.

In [14]:
%%time

#Counting using loop

Sp_list = hip_sp['Spectral_type']

spectral_groups = {}
for spectral_type in Sp_list:
    if spectral_type not in spectral_groups:
       spectral_groups[spectral_type] = 1
    else:
       spectral_groups[spectral_type] += 1 

#printing first three spectral groups
print(list(spectral_groups.items())[:3])

[('F5          ', 3852), ('K3V         ', 214), ('F0V         ', 914)]
Wall time: 29 ms


In [15]:
%%time

#Counting using counter 

Sp_list = hip_sp['Spectral_type']

from collections import Counter

#create instance of counter
counter_dict = Counter(Sp_list)

#printing first three spectral groups
print(list(counter_dict.items())[:3])
#printing the most common spectral groups 
print(counter_dict.most_common(3))

[('F5          ', 3852), ('K3V         ', 214), ('F0V         ', 914)]
[('K0          ', 8538), ('G5          ', 6008), ('F8          ', 4358)]
Wall time: 14.4 ms


### Finding common stars between two lists

What is the best way to find common objects in two lists? Python's built-in set type is much faster than the standard way of searching through two lists. Let's select stars from the Hipparcos catalog with the same spectral types. 

In [16]:
%%time

list_1 = (hip_sp['Spectral_type'][:50000])
list_2 = (hip_sp['Spectral_type'][50001:])

set_1 = set(list_1)
set_2 = set(list_2)

common_stars = set_1.intersection(set_2)
print(len(common_stars))

1255
Wall time: 7.96 ms


### Finding difference and union between two lists

With the two lists of stars from the previous step, we can, for example, find stars that exist only in the first list but not in the second, or in one line we can extract all possible spectral classes from the two lists without heaving to repat the common types from the two lists.

In [17]:
%%time

diff_list = set_1.difference(set_2)
print(len(diff_list))

unique_list = set_1.union(set_2)
print(len(unique_list))

1010
3729
Wall time: 1 ms


### Finding an element in a list

What is the fastest way to search for an element in a list of 113759 objects? We will show below that if a list is of type of set we will be able to find an object faster than in an ordinary list or a tuple. 

In [18]:
list_1 = list(hip_sp['Spectral_type'])
 
new_list = [i.strip(' ') for i in list_1]
print(type(new_list))

%timeit 'A2' in new_list

<class 'list'>
161 ns ± 1.77 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)


In [19]:
new_object = tuple([i.strip(' ') for i in list_1])
print(type(new_object))

%timeit 'A2' in new_object

<class 'tuple'>
164 ns ± 9.84 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)


In [20]:
new_object = set([i.strip(' ') for i in list_1])
print(type(new_object))

%timeit 'A2' in new_object

<class 'set'>
45.5 ns ± 3.69 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)


### How to eliminate loops

We will list several possible ways to eliminate loops because they are inefficient and take more lines of code than needed. For example, we can select all-stars with the same spectral type, G2V as our Sun. 

In [21]:
%%time
#for loop approach

star_list = hip_sp[['Spectral_type', 'Mv']]

suntype_stars = []
for i,j in star_list.iterrows():
    if 'G2V' in j['Spectral_type']:
        suntype_stars.append(j['Mv'])

print('List of the absolute magnitudes for Sun-type stars:', suntype_stars[:2], '...')
print(len(suntype_stars))  

List of the absolute magnitudes for Sun-type stars: [-2.8359178440586437, -2.4620964391778735] ...
690
Wall time: 9.07 s


In [22]:
%%time
#list comprehension approach

from statistics import mean
df_sp = hip_sp[['Spectral_type', 'Mv']]

df_sunlike = df_sp[df_sp['Spectral_type'].str.strip() == 'G2V']
print(len(df_sunlike))

sunlike_avg = mean(df_sunlike['Mv'])
print('The average absolute magnitude of the sun-like stars:', 
      round(sunlike_avg, 2))

568
The average absolute magnitude of the sun-like stars: -2.59
Wall time: 49.6 ms


In [23]:
%%time
#NumPy approach

numpy_sunlike= np.array(df_sunlike['Mv'])
print(len(numpy_sunlike))

sunlike_avg = numpy_sunlike.mean()
print('The average absolute magnitude of the sun-like stars:', 
      round(sunlike_avg, 2))

568
The average absolute magnitude of the sun-like stars: -2.59
Wall time: 0 ns


### Using tuples and Python's built-in functions

By combining the tuples as a data type with Python's built-in functions, we can move the calculations above the loop and therefore make the loops more efficient. 

In [24]:
%%time
sp_list = df_sunlike['Spectral_type']
mv_list = df_sunlike['Mv']

result_tuple = []
for star in zip(sp_list, mv_list):
    result_tuple.append(star)
    
result = [*map(list, result_tuple)]  
print(result[:2])

[['G2V         ', -2.8359178440586437], ['G2V         ', -2.4620964391778735]]
Wall time: 999 µs


###  Iterating over DataFrames

If we want to add a new column containing the stellar distance for each of the Hipparcos stars, we can achieve this by applying the following formula: 
$$ d = 10^ {(Vmag - Mv +5)*0.2} $$
where Mv is the absolute stellar magnitude, 
Vmag is apparent stellar magnitude and d is the distance to a star in parsecs. 
To iterate through a DataFrame, we will use different methods, but what is the most efficient one, especially useful for the large datasets? 

In [25]:
#Function for calculating stellar distance

def calc_stellar_distance(mag_ap, mag_ab):
    dis = 10 **((mag_ap - mag_ab + 5)*0.2)
    return np.round(dis, 2)

In [26]:
%%time

#iterating with .iloc

new_hip_1 = hip_sp.copy()

distance_list = []

for i in range(len(new_hip_1)):
    row = new_hip_1.iloc[i]
    apparent_mag = row['Vmag']
    absolute_mag = row['Mv']
    distance = calc_stellar_distance(apparent_mag, absolute_mag)
    distance_list.append(distance)
    
new_hip_1['Distance'] = distance_list  
print('Number of rows and columns in data frame:', new_hip_1.shape)

Number of rows and columns in data frame: (110043, 13)
Wall time: 18.3 s


In [27]:
print(new_hip_1.head(5))

   Hip_No        Alpha        Delta   Vmag    B-V   V-I    Plx  e_Plx  \
0       1  00 00 00.22  +01 05 20.4   9.10   3.54  1.39  0.482   0.55   
1       2  00 00 00.91  -19 29 55.8   9.27  21.90  3.10  0.999   1.04   
3       4  00 00 02.01  -51 53 36.8   8.06   7.75  0.97  0.370   0.43   
4       5  00 00 02.39  -40 35 28.4   8.55   2.87  1.11  0.902   0.90   
5       6  00 00 04.35  +03 56 47.4  12.31  18.80  4.99  1.336   1.55   

   Var_period Var_type Spectral_type        Mv  Distance  
0         NaN           F5           -2.484765   2074.69  
1         NaN        C  K3V          -0.732173   1001.00  
3         NaN           F0V          -4.098991   2702.70  
4         NaN           G8III        -1.673967   1108.65  
5         NaN           M0V:          2.939032    748.50  


In [28]:
%%time

#iterating with .iterrows

new_hip_2 = hip_sp.copy()
distance_list = []

for i,row in new_hip_2.iterrows():
    apparent_mag = row['Vmag']
    absolute_mag = row['Mv']
    distance = calc_stellar_distance(apparent_mag, absolute_mag)
    distance_list.append(distance)
    
new_hip_2['Distance'] = distance_list  
print('Number of rows and columns in data frame:', new_hip_2.shape)

Number of rows and columns in data frame: (110043, 13)
Wall time: 10.2 s


In [29]:
print(new_hip_2.head(5))

   Hip_No        Alpha        Delta   Vmag    B-V   V-I    Plx  e_Plx  \
0       1  00 00 00.22  +01 05 20.4   9.10   3.54  1.39  0.482   0.55   
1       2  00 00 00.91  -19 29 55.8   9.27  21.90  3.10  0.999   1.04   
3       4  00 00 02.01  -51 53 36.8   8.06   7.75  0.97  0.370   0.43   
4       5  00 00 02.39  -40 35 28.4   8.55   2.87  1.11  0.902   0.90   
5       6  00 00 04.35  +03 56 47.4  12.31  18.80  4.99  1.336   1.55   

   Var_period Var_type Spectral_type        Mv  Distance  
0         NaN           F5           -2.484765   2074.69  
1         NaN        C  K3V          -0.732173   1001.00  
3         NaN           F0V          -4.098991   2702.70  
4         NaN           G8III        -1.673967   1108.65  
5         NaN           M0V:          2.939032    748.50  


In [30]:
%%time

#iterating with .itertuples

new_hip_3 = hip_sp.copy()
distance_list = []

for row in new_hip_3.itertuples(name = None):
    apparent_mag = row[2]
    absolute_mag = row[3]
    distance = calc_stellar_distance(apparent_mag, absolute_mag)
    distance_list.append(distance)
    
new_hip_3['Distance'] = distance_list 
print('Number of rows and columns in data frame:', new_hip_3.shape)

TypeError: unsupported operand type(s) for -: 'str' and 'str'

In [31]:
print(new_hip_3.head(5))

   Hip_No        Alpha        Delta   Vmag    B-V   V-I    Plx  e_Plx  \
0       1  00 00 00.22  +01 05 20.4   9.10   3.54  1.39  0.482   0.55   
1       2  00 00 00.91  -19 29 55.8   9.27  21.90  3.10  0.999   1.04   
3       4  00 00 02.01  -51 53 36.8   8.06   7.75  0.97  0.370   0.43   
4       5  00 00 02.39  -40 35 28.4   8.55   2.87  1.11  0.902   0.90   
5       6  00 00 04.35  +03 56 47.4  12.31  18.80  4.99  1.336   1.55   

   Var_period Var_type Spectral_type        Mv  
0         NaN           F5           -2.484765  
1         NaN        C  K3V          -0.732173  
3         NaN           F0V          -4.098991  
4         NaN           G8III        -1.673967  
5         NaN           M0V:          2.939032  


The fastest way of iterating through a data frame is by using the .itertuples() method with the name parameter set to None. 

In [32]:
%%time

#using pandas apply method

new_hip_4 = hip_sp.copy()

df = new_hip_4.apply(
         lambda row: calc_stellar_distance(row['Vmag'], row['Mv']), axis = 1)
    
new_hip_4['Distance'] = df
print('Number of rows and columns in data frame:', new_hip_4.shape)

Number of rows and columns in data frame: (110043, 13)
Wall time: 2.76 s


In [33]:
print(new_hip_4.head(5))

   Hip_No        Alpha        Delta   Vmag    B-V   V-I    Plx  e_Plx  \
0       1  00 00 00.22  +01 05 20.4   9.10   3.54  1.39  0.482   0.55   
1       2  00 00 00.91  -19 29 55.8   9.27  21.90  3.10  0.999   1.04   
3       4  00 00 02.01  -51 53 36.8   8.06   7.75  0.97  0.370   0.43   
4       5  00 00 02.39  -40 35 28.4   8.55   2.87  1.11  0.902   0.90   
5       6  00 00 04.35  +03 56 47.4  12.31  18.80  4.99  1.336   1.55   

   Var_period Var_type Spectral_type        Mv  Distance  
0         NaN           F5           -2.484765   2074.69  
1         NaN        C  K3V          -0.732173   1001.00  
3         NaN           F0V          -4.098991   2702.70  
4         NaN           G8III        -1.673967   1108.65  
5         NaN           M0V:          2.939032    748.50  
