# Efficient Looping with Itertools

Itertools is a module that comes built-in with your Python installations, and contains functions that create iterators for efficient looping.

## Iterators and Iterables

**Iterable**<br>
- An object capable of returning its members one at a time i.e. anything that can be looped over.
- Examples of iterables include all sequence types (such as list, str, and tuple) and some non-sequence types like dict, file objects.
- You can identify an iterable object by the presence of the \_\_iter__ method.

In [None]:
nums = [10, 20, 30]

In [None]:
for i in nums:
    print(i)

In [None]:
print(dir(nums))

In [None]:
'__iter__' in dir(nums)

**Iterator**<br>
- An object with a state, that remembers what the current state is, and knows how to get the next state or value.
- You can identify an iterator by the presence of the \_\_next__ method.
- You can create an iterator for an iterable by calling the \_\_iter__ method.

In [None]:
i_nums = nums.__iter__() # iter(nums) does the same thing

In [None]:
print(i_nums)

In [None]:
print(dir(i_nums))

In [None]:
'__next__' in dir(i_nums)

In [None]:
print(next(i_nums)) # same as: print(i_nums.__next__())

In [None]:
print(next(i_nums)) 

In [None]:
print(next(i_nums)) 

In [None]:
print(next(i_nums)) 

The `StopIteration` exception tells us that the iterator has run through all the members of the list it was iterating through.

**New perspective on for loops**<br>
Behind the scenes, a `for` loop creates an iterator, uses the `next()` method to get the next value (or state) of the iterator, until it runs into the StopIteration exception, at which point it breaks out of the loop.

In [None]:
for i in nums:
    print(i)

### Exercise 1
(a) Is the string 'hello' an iterable? <br>
(b) How can you know this using Python code? <br>
(c) Create an iterator for 'hello' <br>
(d) Is my_set {1, 2, 3} an iterable? Why or why not?

In [None]:
string = "hello"
my_set = {1, 2, 3}

In [None]:
# your code here



## Itertools functions
According to the itertools docs, it is a “module [that] implements a number of iterator building blocks inspired by constructs from APL, Haskell, and SML… Together, they form an ‘iterator algebra’ making it possible to construct specialized tools succinctly and efficiently in pure Python.”

Loosely speaking, this means that the functions in itertools “operate” on iterators to produce more complex iterators. Consider, for example, the built-in `zip()` function, which takes any number of iterables as arguments and returns an iterator over tuples of their corresponding elements:

In [None]:
numbers = [1, 2, 3]
letters = ['a', 'b', 'c']

zip(numbers, letters)

In [None]:
list(zip(numbers, letters))

How, exactly, does `zip()` work?

- `[1, 2, 3]` and `['a', 'b', 'c']`, like all lists, are iterable, which means they can return their elements one at a time. <br>
- Under the hood, the `zip()` function works, in essence, by calling `iter()` on each of its arguments, then advancing each iterator returned by `iter()` with `next()` and aggregating the results into tuples. The iterator returned by `zip()` iterates over these tuples. <br><br>
In other words, iterators themselves are _iterable_, making it possible to do complex looping operations with them.

In [None]:
'__iter__' in dir(i_nums)

**Why should you use iterators?** <br>
There are two main reasons why such an “iterator algebra” is useful:
- improved memory efficiency (via lazy evaluation)
- faster execution time.

Essentially when you use an iterator, it only needs to store it's current state in memory, and know how to get the next state or value. 
Suppose in the `zip` example above, both your lists had a 100 million elements. It would take up a lot of memory to store all these values. However, an iterator only stores one value from each list at a time - saving immensely on memory resources.

Today we will cover some widely-used itertools functions, however there are mamy more that you can check out at https://docs.python.org/3/library/itertools.html

In [None]:
import itertools

In [None]:
print(dir(itertools)) # your version might have some newer methods as well

## Types of iterators
- Infinite iterators
- Finite iterators
- Combinatoric iterators

### Infinite iterators
Let's start with infinite iterators, the generators that never say "stop." They just keep going and going! In this section, we'll focus on three basic but powerful infinite iterators: `count`, `cycle`, and `repeat`. Each has its own unique utility and can make your data science tasks more efficient.

#### count()

`count` creates an infinite iterator that generates consecutive numbers. In data science, this function can be used for adding indexes or time-stamps to a data set.

In [None]:
# Start counting from 1 , increments by 1 indefinitely
counter = itertools.count(start=1)
print(next(counter))
print(next(counter))
print(next(counter))
print(next(counter))

In [None]:
# Simulated data points for weight measurements
weight_data = [70, 71, 69, 68, 70]

# Create a dictionary with unique ID numbers for each weight
indexed_data = {}
for index, weight in zip(counter, weight_data):
    indexed_data[index] = weight
print(indexed_data)

Why did the indexing start from 5?

#### cycle()

`cycle` creates an infinite iterator that loops over an input sequence indefinitely. In data science, this can be useful for tasks that require periodic or cyclical patterns. 

In [None]:
# Create a cycle of days of the week
days_of_week = itertools.cycle(['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'])

# Simulated temperature data for several days
temperatures = [23, 25, 22, 21, 20, 19, 24, 22, 18, 17, 19, 23]

# temp-day pairs
weather_report = list(zip(days_of_week, temperatures))
print(weather_report)

#### repeat()

`repeat` creates an iterator that produces a specified value indefinitely or for a given number of times. In data science, repeat() can be useful when you need to fill or extend data with a constant value. Imagine you're tracking monthly sales data and you want to project the same revenue for the next few months as a placeholder.

In [None]:
# Existing revenue data for past months
past_revenue = [4000, 4200, 4500]

# Use repeat to generate the same revenue for the next 3 months
revenue_projection = list(itertools.repeat(5000, 3))

# Combine past revenue and future projections
total_revenue = past_revenue + revenue_projection

print(total_revenue)

### Exercise 2
You have a new drug treatment you want to test in clinical trials. You have 3 categories of treatment: 'placebo', 'drug_5mg' , 'drug_10mg'. You don't know how many patients will register for this trial but you would like equal numbers of patients in each group. How can you use itertools to assign patients to treatment categories?

In [None]:
treament_categories = [ 'placebo', 'drug_5mg' , 'drug_10mg']
patient_list = ['Anna', 'John', 'Linzang', 'Corey', 'Mayuri', 'Ezra', 'Magneto']

# your code here



### Finite iterators
These are tools that deal with tasks having a definite endpoint.

#### chain()
`chain()` creates an iterator that links multiple sequences together. In data science, this can be useful for combining disparate data sets into a single sequence for easier analysis. Suppose you have quarterly sales data stored in different lists, and you want to analyze the sales data for the entire year.

In [None]:
# Quarterly sales data for a year
Q1_sales = [1000, 1100, 1050]
Q2_sales = [1200, 1300, 1250]
Q3_sales = [1100, 1000, 1150]
Q4_sales = [1050, 1100, 1200]

# Use chain to combine all the sales data
annual_sales = list(itertools.chain(Q1_sales, Q2_sales, Q3_sales, Q4_sales))

print(annual_sales)

`chain` yields the elements of the first iterator until it gets exhausted, and then it yields the elements of the second one.

#### compress()
`compress()` produces an iterator that filters elements from an input sequence based on another iterable containing Boolean (True/False) values. In this sense, it is similar to `filter`, except that the `compress` method is much faster than the `filter` method. <br><br>
Let’s say you have a list of product reviews and a corresponding list that marks each review as either relevant or not relevant.

In [None]:
# Product reviews
reviews = ["Great!", "Bad!", "Average.", "Excellent!", "Poor!"]

# Relevance flags: 1 for relevant, 0 for irrelevant
relevance_flags = [1, 0, 0, 1, 0]

# Use compress to keep only relevant reviews
relevant_reviews = list(itertools.compress(reviews, relevance_flags))

print(relevant_reviews)

#### dropwhile()
`dropwhile(func, seq)` creates an iterator that drops elements from an input sequence as long as a given condition is true. <br>
In data science, this function is useful for ignoring a segment of data that doesn't meet certain criteria.<br><br>
For instance, if you're analyzing website traffic and want to focus only on periods of high activity, `dropwhile()` can help you ignore low-traffic intervals.

In [None]:
# Hourly website traffic counts
traffic_data = [10, 12, 8, 15, 20, 25, 30, 10, 5, 8]

# Function to check for low traffic
def is_low_traffic(x):
    return x < 15

# Use dropwhile to ignore low-traffic hours
high_traffic_data = list(itertools.dropwhile(is_low_traffic, traffic_data))

print(high_traffic_data)

In this code, `dropwhile()` uses the `is_low_traffic` function to skip over the first elements that are below 15. Once it encounters an element that is 15 or higher, it includes all elements after that. The result is a new list that starts from the first high-traffic hour. This is handy in data science to focus on segments of data that are relevant to the analysis, allowing for more targeted insights.

#### takewhile()
Similarly, its counterpart `takewhile()` allows considering an item from the iterable until the specified predicate becomes false for the first time.

In [None]:
# Use takewhile to only keep low-traffic hours
low_traffic_data = list(itertools.takewhile(is_low_traffic, traffic_data))
print(low_traffic_data)

Why did `takewhile()` not return the other elements at the end of traffic_data that were smaller than 15?

### Exercise 3
You have the a company's stock data in terms of percent growth per year. You want to find a streak of good performace whene company stock growth was positive. 

In [None]:
# percent growth per year
value_list =[5, 6, -8, -4, 2] 

# your code here
# Hint: define a function to test for positive growth in a given year
def is_positive(n):
    ...


result = list(itertools....(function, sequence))
print(result)

### Combinatoric iterators
These are special tools designed to perform combinations, permutations, and cross-products. They are extremely useful when we need to explore all possible scenarios or arrangements in data science tasks.

#### product()
`product()` creates an iterator that produces the Cartesian product of input iterables.<br>
In data science, this function is particularly useful for generating all possible combinations of different sets of parameters for model tuning.

Imagine you want to try out various combinations of learning rates and batch sizes to optimize a machine-learning model.

In [None]:
# Possible learning rates and batch sizes
learning_rates = [0.01, 0.1, 0.5]
batch_sizes = [32, 64, 128]

# Generate all combinations using product
parameter_combinations = list(itertools.product(learning_rates, batch_sizes))

print(parameter_combinations)

`product()` takes two lists: `learning_rates` and `batch_sizes`. It then creates an iterator that gives you all possible pairs between these two lists. The result, `parameter_combinations`, includes all possible combinations of learning rates and batch sizes.

This helps in running experiments to fine-tune machine learning models, allowing you to test each combination systematically.

#### permutations()
`permutations()` creates an iterator that produces all possible arrangements of an input iterable.<br> 

For example, say you have 5 political candidates, and you want to generate every possible scenario of a poll-result - where a President and a Vice-President are chosen. Here, the order matters since the first candidate will be president and the second candidate will be vice-president.

In [None]:
# Candidates
candidates = ['A' , 'B', 'C', 'D', 'E']

# Generate all permutations of features
candidate_permutations = list(itertools.permutations(candidates, 2)) # 

print(candidate_permutations)

`permutations()` takes the list features and the integer 2, which specifies the length of the permutations we want to generate. The output gives you all 2-element arrangements of the given features.

#### combinations()
`combinations()` creates an iterator over unique combinations of elements from an input iterable.<br>
In data science, you often face problems where you need to select a subset of features to feed into a model.

Imagine you have a dataset with columns like 'Age', 'Salary', and 'Years of Experience'. Using `combinations()`, you could generate all possible pairs or triplets of these columns to find the most predictive set.

In [None]:
# List of features in a dataset
features = ['Age', 'Salary', 'Years of Experience']

# Generate all 2-element combinations of features
feature_combinations = list(itertools.combinations(features, 2))

print(feature_combinations)

This code takes the list of features and generates all unique 2-element combinations. You could then use this output to train your model with different feature sets.

The idea is to find out which group of features makes your model the most accurate. This can help you improve model performance without adding extra complexity.

### Exercise 4
You are building a machine learning model which takes in genetic sequences of length 3 and predicts how immunogenic the sequence is.
You want to start with the sequence below and test all possible sequences to find out the most immunogenic sequence in order to design your world-saving vaccine. How would you do this?

In [None]:
sequence = ['T', 'A', 'G', 'C']

# your code here


**Acknowledgements**: https://www.stratascratch.com/blog/using-python-itertools-for-efficient-looping/